Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
123 changes: 123 additions & 0 deletions skills/cecil-data-analysis-xql/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
name: cecil-data-analysis-xql
description: Use this skill when the user asks an analysis question about Cecil earth-observation data — forest carbon, biomass, deforestation, land cover, soil moisture, biodiversity, protected areas, ecosystem integrity. The skill writes SQL with xarray-sql (xql) against the user's existing subscription and presents query → result table → plain-English interpretation as one compact block.
license: MIT
---

# Cecil data analysis with xarray-sql

Use this skill for questions like *"how has tree cover changed in my AOI?"* or *"what's the dominant land cover here?"* — anything that wants to **query** an existing Cecil subscription with SQL.

For setup, installation, and the runnable scripts that drive these queries, see [`tutorials/cecil-data-analysis-xql/`](../../tutorials/cecil-data-analysis-xql/). This SKILL.md covers only what an agent needs to know in-context.

## Related skills

- [`subscribe-and-load`](../subscribe-and-load/SKILL.md) — sets up the subscription + AOI that the queries below assume.
- [`land-cover-baseline-and-change`](../land-cover-baseline-and-change/SKILL.md) — the hand-written numpy/xarray counterpart for readers who want to skip the SQL layer.

## When this skill is in scope

**Yes:** dominant-class, time-series, change-detection, area, or overlap questions on Cecil rasters (biomass, LULC, soil moisture, BII, FLII) or vector datasets (IBAT family — protected areas, KBA, species ranges).

**No:** weather forecasts, atmospheric chemistry / GHGs, ocean / marine data, generic Python help, anything without a spatial AOI. Cecil doesn't carry that data — say so plainly and stop. Don't force-fit a dataset.

## Subscription gate (the rule that protects the wallet)

**Default: read-only on existing subscriptions.** Run `scripts/list_subscriptions.py` first. If a subscription for the chosen dataset already exists, go straight to the SQL.

If no matching subscription exists:

1. Summarise what you'd subscribe to: dataset name, provider, coverage, relevant variables.
2. Show the exact calls you'd make: `client.create_aoi(geometry=...)` (if no AOI) and `client.create_subscription(aoi_id=..., dataset_id=...)`.
3. **Stop and ask the user to confirm** before calling either method. Subscriptions are billed. Get explicit confirmation for *this* dataset *now*, even if the user agreed to a different one earlier in the conversation.

If the user declines, say so and stop. Don't try to find a free substitute that might silently answer a different question.

## Golden rule: query → table → insight

Whenever this skill is in scope, present every analysis as one compact block, in this order:

1. **SQL** — fenced `sql` block. Applies at every stage: `DESCRIBE`, exploratory queries, final analysis.
2. **Result table** — Markdown, **capped at 10 rows**. If longer, end with *"… N more in result.csv"*. Never dump 200 rows inline.
3. **Interpretation** — two or three sentences of plain-English insight. No header, no "here are the results:" preamble.

For exploratory queries with nothing to interpret (`DESCRIBE`, etc.), just show 1 + 2.

## Table naming

The xarray Dataset is registered under a name derived from the dataset (provider prefix stripped, slugified):

| Dataset name | Table name |
|---|---|
| Impact Observatory — Land Cover 9-Class | `land_cover_9_class` |
| Chloris — Aboveground Biomass Stock and Change 10m | `aboveground_biomass_stock_and_change_10m` |
| Lobelia Earth — Surface Soil Moisture 30m | `surface_soil_moisture_30m` |

Use the same name in your SQL (`FROM land_cover_9_class`, not `FROM cecil`). `scripts/run_analysis.py` does the derivation automatically and prints the table name.

## SQL idioms

**Always start with `DESCRIBE`** to confirm column names before writing real SQL:

```sql
DESCRIBE land_cover_9_class;
```

**Top class per timestamp** — `ROW_NUMBER OVER`:

```sql
SELECT time, ref."Name", cnt
FROM (
SELECT time, land_cover_class, COUNT(*) AS cnt,
ROW_NUMBER() OVER (PARTITION BY time ORDER BY COUNT(*) DESC) AS rn
FROM land_cover_9_class
GROUP BY time, land_cover_class
) r
JOIN ref_land_cover_class ref ON r.land_cover_class = ref."Index"
WHERE rn = 1
ORDER BY time;
```

**Change detection** — `LAG OVER` per pixel:

```sql
SELECT x, y, time,
land_cover_class AS new_class,
LAG(land_cover_class) OVER (PARTITION BY x, y ORDER BY time) AS prev_class
FROM land_cover_9_class;
```

Filter with `WHERE prev_class IS DISTINCT FROM new_class AND prev_class IS NOT NULL` to keep only changed pixels.

**Time-series aggregation** — plain `GROUP BY time`:

```sql
SELECT time, AVG(aboveground_biomass) AS mean_biomass
FROM aboveground_biomass_stock_and_change_10m
WHERE aboveground_biomass IS NOT NULL
GROUP BY time
ORDER BY time;
```

## Quoting rule (datafusion gotcha)

Reference-table columns from the API are **PascalCase** (`"Index"`, `"Name"`). **Double-quote them in SQL** — datafusion lower-cases unquoted identifiers and you'll see *"Schema error: no field named index"*. Get this right and joins work.

## Vector vs raster datasets

Most Cecil datasets are raster (`load_xarray`). The **IBAT family** (IUCN Red List, KBA, WDPA, STAR) is vector — use `load_dataframe`. `run_analysis.py` accepts `--vector` for these:

```bash
uv run python scripts/run_analysis.py --vector \
--subscription <id> --sql 'SELECT * FROM wdpa_protected_areas LIMIT 5'
```

The full SDK gotchas (auth, `subscription_id` vs `dataset_id`, AOI geometry rules) live in [`tutorials/cecil-data-analysis-xql/references/sdk.md`](../../tutorials/cecil-data-analysis-xql/references/sdk.md).

## Failure modes — say so plainly

- **No subscriptions and no clear dataset match** → say what you looked for, list candidates, stop.
- **SQL fails twice with the same error** → stop and show the user. Often the fix is the `DESCRIBE` you skipped.
- **Question isn't really about Cecil data** → decline. Better than silently producing a confident-sounding irrelevant answer.

For the full worked end-to-end example (dominant land cover 2020 vs 2023 with chart), see the tutorial.
180 changes: 180 additions & 0 deletions tutorials/cecil-data-analysis-xql/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,180 @@
# Cecil data analysis with xarray-sql — tutorial

End-to-end walkthrough for querying Cecil earth-observation data with SQL via [xarray-sql](https://github.com/xarray-contrib/xarray-sql) (`xql`). This is the human-facing setup and worked example; the agent-facing condensed version lives in [`skills/cecil-data-analysis-xql/SKILL.md`](../../skills/cecil-data-analysis-xql/SKILL.md).

```
tutorials/cecil-data-analysis-xql/
├── README.md (this file)
├── references/
│ ├── datasets.md catalog pointers — read first when picking a dataset
│ ├── sdk.md condensed Cecil SDK reference
│ └── xarray_sql.md XarrayContext patterns and SQL gotchas
└── scripts/
├── _env.py auto-loads .env so you don't have to
├── list_subscriptions.py what's already paid for
├── inspect_dataset.py variables, units, reference_table, constraints
└── run_analysis.py load → register → run SQL → save outputs
```

## Step 0 — Prerequisites

1. **Install Python dependencies** with `uv`:

```bash
uv add cecil 'xarray-sql>=0.0.1,<0.1' matplotlib pandas tabulate
```

`xarray-sql` is pre-1.0; pin to a sub-`0.1` range so `XarrayContext().from_dataset()` keeps working when the next minor lands.

2. **Create a `.env` file** in the working directory with your API key:

```
CECIL_API_KEY=your-key-here
```

The scripts auto-load `.env` via `scripts/_env.py` — no manual `source` needed.

3. **Verify the key works**. Open a shell with `.env` loaded and run:

```bash
uv run python -c "import cecil; cecil.Client().list_datasets(); print('authenticated')"
```

If this returns 401, the key is invalid or expired. Fix and re-run before proceeding.

## Step 1 — Pick a dataset

1. Call the live catalog from the SDK:

```python
import cecil
for d in cecil.Client().list_datasets():
print(d.id, d.name, d.categories,
d.spatial_resolution.nominal, d.temporal_coverage.nominal)
```

See [`references/datasets.md`](references/datasets.md) for selection guidance — which dataset answers which question, plus the cross-cutting gotchas.

2. **See what's already paid for.** Strongly prefer existing subscriptions; they're free at the margin and the data is already provisioned:

```bash
uv run python scripts/list_subscriptions.py
```

3. If two datasets are genuinely plausible for the question (e.g. Chloris vs Sylvera for biomass), present 2–3 candidates with one-line tradeoffs and let the user pick.

4. Once chosen, see its variables, units, and any reference tables you'll need to join:

```bash
uv run python scripts/inspect_dataset.py <dataset_id>
```

## Step 2 — Subscription gate

**Default: read-only on existing subscriptions.** If `list_subscriptions.py` already shows the dataset, skip to Step 3.

If no matching subscription exists, **stop and confirm with the user before creating one** — subscriptions are billed. Print the exact calls you'd make (`client.create_aoi(geometry=...)` then `client.create_subscription(aoi_id=..., dataset_id=...)`) and wait for explicit confirmation.

## Step 3 — Load and register

The canonical pattern (wrapped by `scripts/run_analysis.py`):

```python
import cecil
import pandas as pd
import xarray_sql as xql

client = cecil.Client()
ds = client.load_xarray(subscription_id) # raster
# ds = client.load_dataframe(subscription_id) # vector (IBAT)

table_name = ... # derived from ds.attrs["dataset_name"]
ctx = xql.XarrayContext().from_dataset(table_name, ds)

dataset = client.get_dataset(ds.attrs["dataset_id"])
for var in dataset.variables:
if var.reference_table:
ctx.from_pandas(pd.DataFrame(var.reference_table), f"ref_{var.name}")
```

**Critical gotcha:** `load_xarray` and `load_dataframe` take a `subscription_id`, **not** a `dataset_id`. If you get a 404, you almost certainly passed a dataset id. See [`references/sdk.md`](references/sdk.md) for the other SDK gotchas.

**Reference tables** turn integer class codes into human names. Without them, a result of "class 7 has 14921 pixels" means nothing; with them it becomes "Built area covers 14921 pixels".

## Step 4 — Write the SQL

1. **Always `DESCRIBE` first** to confirm column names. Saves a round-trip vs hallucinating them:

```sql
DESCRIBE land_cover_9_class;
```

2. **Quote PascalCase columns** (`"Index"`, `"Name"`) from reference tables — datafusion lower-cases unquoted identifiers and you'll get *"Schema error: no field named index"*.

3. **Narrow exploratory queries** with `WHERE time = ...` or `LIMIT` — the xarray Dataset is dask-backed and `SELECT *` on a multi-year sub-meter raster pulls a lot of data.

4. **Idioms** (`ROW_NUMBER`, `LAG`, time-series aggregation) are in [`references/xarray_sql.md`](references/xarray_sql.md).

## Step 5 — Run, present, save

```bash
uv run python scripts/run_analysis.py \
--subscription <subscription_id> \
--sql 'SELECT DISTINCT land_cover_class FROM land_cover_9_class'
```

For vector datasets (IBAT — protected areas, KBA, species), add `--vector`:

```bash
uv run python scripts/run_analysis.py --vector \
--subscription <subscription_id> \
--sql 'SELECT * FROM wdpa_protected_areas LIMIT 5'
```

Outputs land in the current directory (or `--output-dir`):

- `result.csv` — raw rows
- `result.md` — markdown table + the SQL that was run
- `result.png` — chart, when the result has a plottable shape

Plus a markdown summary on stdout so the answer is visible immediately.

**Present the output using the golden rule** (the skill says the same): SQL → table (≤10 rows) → 2-3 sentence interpretation, in one compact block. No filler, no preamble.

## Worked example end to end

User: *"For our project AOI, what's the dominant land cover class in 2023 and how does it compare to 2020?"*

1. `list_subscriptions.py` shows a sub for "Impact Observatory — Land Cover 9-Class". Use it. Table name → `land_cover_9_class`.
2. `inspect_dataset.py <id>` confirms the variable is `land_cover_class` with a 9-row `reference_table`.
3. Write the SQL, run it via `run_analysis.py`, and present:

```sql
WITH yearly AS (
SELECT
date_part('year', time) AS year,
land_cover_class,
COUNT(*) AS px,
ROW_NUMBER() OVER (
PARTITION BY date_part('year', time)
ORDER BY COUNT(*) DESC
) AS rn
FROM land_cover_9_class
WHERE date_part('year', time) IN (2020, 2023)
GROUP BY 1, 2
)
SELECT y.year, ref."Name" AS dominant_class, y.px
FROM yearly y
JOIN ref_land_cover_class ref ON y.land_cover_class = ref."Index"
WHERE y.rn = 1
ORDER BY y.year;
```

| year | dominant_class | px |
|------|----------------|------:|
| 2020 | Forest | 52014 |
| 2023 | Forest | 48831 |

Forest dominated the AOI in both years (52k pixels in 2020, 49k in 2023). The 6% drop is offset by an increase in built area, suggesting modest expansion at the edges. Chart saved to `result.png`.

That's the loop. Read references when you need them, run scripts instead of re-implementing them, gate paid actions on explicit confirmation, and present query → data → insight as one block.
78 changes: 78 additions & 0 deletions tutorials/cecil-data-analysis-xql/references/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Picking a Cecil dataset

For the live catalog of every published dataset, **call the SDK** — don't
trust a static list (this file used to be one and went stale within weeks):

```python
import cecil
for d in cecil.Client().list_datasets():
print(d.id, d.name, d.categories,
d.spatial_resolution.nominal, d.temporal_coverage.nominal)
```

Every `Dataset` carries `id, name, type, categories, providers, variables,
description, usage_notes, constraints, pricing, spatial_resolution,
temporal_coverage, …`. This file only adds the editorial layer the SDK
*can't* give you — which dataset to pick for which question — plus the
cross-cutting gotchas that affect SQL.

## Selection guidance

**Plant biomass / forest carbon** — Chloris (project-scale: 10 m from 2017+,
30 m from 2000+), Kanop (project monitoring; the Screening tier is the
cheap pre-flight pass), Planet Forest Carbon (Diligence 30 m, Monitoring
3.5 m quarterly), Sylvera (long-term atlas), UMD Hansen (industry standard
for tree loss / gain).

**Land use & land cover** — Impact Observatory 9-/15-/17-Class
(simplified → detailed → very-high-res; pick by class count + pixel size),
JRC Global Forest Cover / Type (2020 snapshots), USDA Cropland Data Layer
(US-only; 10 m for recent, 30 m for long history), USGS Annual NLCD
(40 years of US LULC change), WRI SBTN Natural Lands (SBTN-aligned, 2020),
WRI Tropical Tree Cover (tropics-only).

**Soil moisture** — Lobelia Earth 30 m (field-scale), Planet Soil Water
Content 20 m (frequent, very-high-res) or 1 km (regional, history back to
2002).

**Biodiversity** — IBAT family (IUCN Red List, KBA, WDPA, STAR) for
species / KBA / protected-area overlaps, NHM BII 1 km / 10 km for
biodiversity-intactness scoring.

**Ecosystem integrity** — WCS Forest Landscape Integrity Index
(landscape fragmentation, 2019 snapshot).

## Cross-cutting gotchas

- **LULC integer codes.** `land_cover_class` (and similar) are integer
codes that mean nothing without joining the variable's `reference_table`.
See `xarray_sql.md` for the join pattern.
- **IBAT is vector.** Use `client.load_dataframe(...)`, **not**
`load_xarray`. IBAT datasets are pandas DataFrames, not gridded rasters.
- **US-only datasets.** USDA Cropland Data Layer and USGS NLCD don't cover
non-US AOIs. WRI Tropical Tree Cover is tropics-only.
- **BII history stops in 2021.** If the question is about 2022+, BII
won't help.

## Quick "what dataset answers what question" map

| User asks about... | Try... |
|---|---|
| Forest carbon stock or change | Chloris, Kanop, Planet Forest Carbon, Sylvera |
| Tree-cover loss / deforestation | UMD Hansen, JRC Forest Cover |
| Land cover class history | Impact Observatory (global) or USGS NLCD (US-only) |
| Crop type (US) | USDA Cropland Data Layer |
| Tropical forest extent | WRI Tropical Tree Cover, JRC Forest Cover |
| Soil moisture / drought | Lobelia Earth, Planet Soil Water Content |
| Threatened species overlap | IBAT IUCN Red List (vector — `load_dataframe`) |
| Protected area overlap | IBAT WDPA (vector) |
| Key Biodiversity Areas | IBAT KBA (vector) |
| Biodiversity intactness | NHM BII 1km / 10km |
| Forest landscape integrity | WCS Forest Landscape Integrity Index |

## Out of scope for this skill

Cecil does not cover weather forecasts, atmospheric chemistry / air
quality, GHG concentrations, or ocean / marine datasets. If a user asks
for one of these, tell them plainly that Cecil doesn't carry that data —
don't force-fit a dataset that doesn't really answer the question.
Loading
Loading