cecilearth · alexlogs · May 18, 2026 · May 12, 2026
diff --git a/skills/cecil-data-analysis-xql/SKILL.md b/skills/cecil-data-analysis-xql/SKILL.md
@@ -0,0 +1,123 @@
+---
+name: cecil-data-analysis-xql
+description: Use this skill when the user asks an analysis question about Cecil earth-observation data — forest carbon, biomass, deforestation, land cover, soil moisture, biodiversity, protected areas, ecosystem integrity. The skill writes SQL with xarray-sql (xql) against the user's existing subscription and presents query → result table → plain-English interpretation as one compact block.
+license: MIT
+---
+
+# Cecil data analysis with xarray-sql
+
+Use this skill for questions like *"how has tree cover changed in my AOI?"* or *"what's the dominant land cover here?"* — anything that wants to **query** an existing Cecil subscription with SQL.
+
+For setup, installation, and the runnable scripts that drive these queries, see [`tutorials/cecil-data-analysis-xql/`](../../tutorials/cecil-data-analysis-xql/). This SKILL.md covers only what an agent needs to know in-context.
+
+## Related skills
+
+- [`subscribe-and-load`](../subscribe-and-load/SKILL.md) — sets up the subscription + AOI that the queries below assume.
+- [`land-cover-baseline-and-change`](../land-cover-baseline-and-change/SKILL.md) — the hand-written numpy/xarray counterpart for readers who want to skip the SQL layer.
+
+## When this skill is in scope
+
+**Yes:** dominant-class, time-series, change-detection, area, or overlap questions on Cecil rasters (biomass, LULC, soil moisture, BII, FLII) or vector datasets (IBAT family — protected areas, KBA, species ranges).
+
+**No:** weather forecasts, atmospheric chemistry / GHGs, ocean / marine data, generic Python help, anything without a spatial AOI. Cecil doesn't carry that data — say so plainly and stop. Don't force-fit a dataset.
+
+## Subscription gate (the rule that protects the wallet)
+
+**Default: read-only on existing subscriptions.** Run `scripts/list_subscriptions.py` first. If a subscription for the chosen dataset already exists, go straight to the SQL.
+
+If no matching subscription exists:
+
+1. Summarise what you'd subscribe to: dataset name, provider, coverage, relevant variables.
+2. Show the exact calls you'd make: `client.create_aoi(geometry=...)` (if no AOI) and `client.create_subscription(aoi_id=..., dataset_id=...)`.
+3. **Stop and ask the user to confirm** before calling either method. Subscriptions are billed. Get explicit confirmation for *this* dataset *now*, even if the user agreed to a different one earlier in the conversation.
+
+If the user declines, say so and stop. Don't try to find a free substitute that might silently answer a different question.
+
+## Golden rule: query → table → insight
+
+Whenever this skill is in scope, present every analysis as one compact block, in this order:
+
+1. **SQL** — fenced `sql` block. Applies at every stage: `DESCRIBE`, exploratory queries, final analysis.
+2. **Result table** — Markdown, **capped at 10 rows**. If longer, end with *"… N more in result.csv"*. Never dump 200 rows inline.
+3. **Interpretation** — two or three sentences of plain-English insight. No header, no "here are the results:" preamble.
+
+For exploratory queries with nothing to interpret (`DESCRIBE`, etc.), just show 1 + 2.
+
+## Table naming
+
+The xarray Dataset is registered under a name derived from the dataset (provider prefix stripped, slugified):
+
+| Dataset name | Table name |
+|---|---|
+| Impact Observatory — Land Cover 9-Class | `land_cover_9_class` |
+| Chloris — Aboveground Biomass Stock and Change 10m | `aboveground_biomass_stock_and_change_10m` |
+| Lobelia Earth — Surface Soil Moisture 30m | `surface_soil_moisture_30m` |
+
+Use the same name in your SQL (`FROM land_cover_9_class`, not `FROM cecil`). `scripts/run_analysis.py` does the derivation automatically and prints the table name.
+
+## SQL idioms
+
+**Always start with `DESCRIBE`** to confirm column names before writing real SQL:
+
+```sql
+DESCRIBE land_cover_9_class;
+```
+
+**Top class per timestamp** — `ROW_NUMBER OVER`:
+
+```sql
+SELECT time, ref."Name", cnt
+FROM (
+  SELECT time, land_cover_class, COUNT(*) AS cnt,
+    ROW_NUMBER() OVER (PARTITION BY time ORDER BY COUNT(*) DESC) AS rn
+  FROM land_cover_9_class
+  GROUP BY time, land_cover_class
+) r
+JOIN ref_land_cover_class ref ON r.land_cover_class = ref."Index"
+WHERE rn = 1
+ORDER BY time;
+```
+
+**Change detection** — `LAG OVER` per pixel:
+
+```sql
+SELECT x, y, time,
+  land_cover_class AS new_class,
+  LAG(land_cover_class) OVER (PARTITION BY x, y ORDER BY time) AS prev_class
+FROM land_cover_9_class;
+```
+
+Filter with `WHERE prev_class IS DISTINCT FROM new_class AND prev_class IS NOT NULL` to keep only changed pixels.
+
+**Time-series aggregation** — plain `GROUP BY time`:
+
+```sql
+SELECT time, AVG(aboveground_biomass) AS mean_biomass
+FROM aboveground_biomass_stock_and_change_10m
+WHERE aboveground_biomass IS NOT NULL
+GROUP BY time
+ORDER BY time;
+```
+
+## Quoting rule (datafusion gotcha)
+
+Reference-table columns from the API are **PascalCase** (`"Index"`, `"Name"`). **Double-quote them in SQL** — datafusion lower-cases unquoted identifiers and you'll see *"Schema error: no field named index"*. Get this right and joins work.
+
+## Vector vs raster datasets
+
+Most Cecil datasets are raster (`load_xarray`). The **IBAT family** (IUCN Red List, KBA, WDPA, STAR) is vector — use `load_dataframe`. `run_analysis.py` accepts `--vector` for these:
+
+```bash
+uv run python scripts/run_analysis.py --vector \
+  --subscription <id> --sql 'SELECT * FROM wdpa_protected_areas LIMIT 5'
+```
+
+The full SDK gotchas (auth, `subscription_id` vs `dataset_id`, AOI geometry rules) live in [`tutorials/cecil-data-analysis-xql/references/sdk.md`](../../tutorials/cecil-data-analysis-xql/references/sdk.md).
+
+## Failure modes — say so plainly
+
+- **No subscriptions and no clear dataset match** → say what you looked for, list candidates, stop.
+- **SQL fails twice with the same error** → stop and show the user. Often the fix is the `DESCRIBE` you skipped.
+- **Question isn't really about Cecil data** → decline. Better than silently producing a confident-sounding irrelevant answer.
+
+For the full worked end-to-end example (dominant land cover 2020 vs 2023 with chart), see the tutorial.
diff --git a/tutorials/cecil-data-analysis-xql/README.md b/tutorials/cecil-data-analysis-xql/README.md
@@ -0,0 +1,180 @@
+# Cecil data analysis with xarray-sql — tutorial
+
+End-to-end walkthrough for querying Cecil earth-observation data with SQL via [xarray-sql](https://github.com/xarray-contrib/xarray-sql) (`xql`). This is the human-facing setup and worked example; the agent-facing condensed version lives in [`skills/cecil-data-analysis-xql/SKILL.md`](../../skills/cecil-data-analysis-xql/SKILL.md).
+
+```
+tutorials/cecil-data-analysis-xql/
+├── README.md                       (this file)
+├── references/
+│   ├── datasets.md                 catalog pointers — read first when picking a dataset
+│   ├── sdk.md                      condensed Cecil SDK reference
+│   └── xarray_sql.md               XarrayContext patterns and SQL gotchas
+└── scripts/
+    ├── _env.py                     auto-loads .env so you don't have to
+    ├── list_subscriptions.py       what's already paid for
+    ├── inspect_dataset.py          variables, units, reference_table, constraints
+    └── run_analysis.py             load → register → run SQL → save outputs
+```
+
+## Step 0 — Prerequisites
+
+1. **Install Python dependencies** with `uv`:
+
+   ```bash
+   uv add cecil 'xarray-sql>=0.0.1,<0.1' matplotlib pandas tabulate
+   ```
+
+   `xarray-sql` is pre-1.0; pin to a sub-`0.1` range so `XarrayContext().from_dataset()` keeps working when the next minor lands.
+
+2. **Create a `.env` file** in the working directory with your API key:
+
+   ```
+   CECIL_API_KEY=your-key-here
+   ```
+
+   The scripts auto-load `.env` via `scripts/_env.py` — no manual `source` needed.
+
+3. **Verify the key works**. Open a shell with `.env` loaded and run:
+
+   ```bash
+   uv run python -c "import cecil; cecil.Client().list_datasets(); print('authenticated')"
+   ```
+
+   If this returns 401, the key is invalid or expired. Fix and re-run before proceeding.
+
+## Step 1 — Pick a dataset
+
+1. Call the live catalog from the SDK:
+
+   ```python
+   import cecil
+   for d in cecil.Client().list_datasets():
+       print(d.id, d.name, d.categories,
+             d.spatial_resolution.nominal, d.temporal_coverage.nominal)
+   ```
+
+   See [`references/datasets.md`](references/datasets.md) for selection guidance — which dataset answers which question, plus the cross-cutting gotchas.
+
+2. **See what's already paid for.** Strongly prefer existing subscriptions; they're free at the margin and the data is already provisioned:
+
+   ```bash
+   uv run python scripts/list_subscriptions.py
+   ```
+
+3. If two datasets are genuinely plausible for the question (e.g. Chloris vs Sylvera for biomass), present 2–3 candidates with one-line tradeoffs and let the user pick.
+
+4. Once chosen, see its variables, units, and any reference tables you'll need to join:
+
+   ```bash
+   uv run python scripts/inspect_dataset.py <dataset_id>
+   ```
+
+## Step 2 — Subscription gate
+
+**Default: read-only on existing subscriptions.** If `list_subscriptions.py` already shows the dataset, skip to Step 3.
+
+If no matching subscription exists, **stop and confirm with the user before creating one** — subscriptions are billed. Print the exact calls you'd make (`client.create_aoi(geometry=...)` then `client.create_subscription(aoi_id=..., dataset_id=...)`) and wait for explicit confirmation.
+
+## Step 3 — Load and register
+
+The canonical pattern (wrapped by `scripts/run_analysis.py`):
+
+```python
+import cecil
+import pandas as pd
+import xarray_sql as xql
+
+client = cecil.Client()
+ds = client.load_xarray(subscription_id)              # raster
+# ds = client.load_dataframe(subscription_id)         # vector (IBAT)
+
+table_name = ...  # derived from ds.attrs["dataset_name"]
+ctx = xql.XarrayContext().from_dataset(table_name, ds)
+
+dataset = client.get_dataset(ds.attrs["dataset_id"])
+for var in dataset.variables:
+    if var.reference_table:
+        ctx.from_pandas(pd.DataFrame(var.reference_table), f"ref_{var.name}")
+```
+
+**Critical gotcha:** `load_xarray` and `load_dataframe` take a `subscription_id`, **not** a `dataset_id`. If you get a 404, you almost certainly passed a dataset id. See [`references/sdk.md`](references/sdk.md) for the other SDK gotchas.
+
+**Reference tables** turn integer class codes into human names. Without them, a result of "class 7 has 14921 pixels" means nothing; with them it becomes "Built area covers 14921 pixels".
+
+## Step 4 — Write the SQL
+
+1. **Always `DESCRIBE` first** to confirm column names. Saves a round-trip vs hallucinating them:
+
+   ```sql
+   DESCRIBE land_cover_9_class;
+   ```
+
+2. **Quote PascalCase columns** (`"Index"`, `"Name"`) from reference tables — datafusion lower-cases unquoted identifiers and you'll get *"Schema error: no field named index"*.
+
+3. **Narrow exploratory queries** with `WHERE time = ...` or `LIMIT` — the xarray Dataset is dask-backed and `SELECT *` on a multi-year sub-meter raster pulls a lot of data.
+
+4. **Idioms** (`ROW_NUMBER`, `LAG`, time-series aggregation) are in [`references/xarray_sql.md`](references/xarray_sql.md).
+
+## Step 5 — Run, present, save
+
+```bash
+uv run python scripts/run_analysis.py \
+    --subscription <subscription_id> \
+    --sql 'SELECT DISTINCT land_cover_class FROM land_cover_9_class'
+```
+
+For vector datasets (IBAT — protected areas, KBA, species), add `--vector`:
+
+```bash
+uv run python scripts/run_analysis.py --vector \
+    --subscription <subscription_id> \
+    --sql 'SELECT * FROM wdpa_protected_areas LIMIT 5'
+```
+
+Outputs land in the current directory (or `--output-dir`):
+
+- `result.csv` — raw rows
+- `result.md` — markdown table + the SQL that was run
+- `result.png` — chart, when the result has a plottable shape
+
+Plus a markdown summary on stdout so the answer is visible immediately.
+
+**Present the output using the golden rule** (the skill says the same): SQL → table (≤10 rows) → 2-3 sentence interpretation, in one compact block. No filler, no preamble.
+
+## Worked example end to end
+
+User: *"For our project AOI, what's the dominant land cover class in 2023 and how does it compare to 2020?"*
+
+1. `list_subscriptions.py` shows a sub for "Impact Observatory — Land Cover 9-Class". Use it. Table name → `land_cover_9_class`.
+2. `inspect_dataset.py <id>` confirms the variable is `land_cover_class` with a 9-row `reference_table`.
+3. Write the SQL, run it via `run_analysis.py`, and present:
+
+   ```sql
+   WITH yearly AS (
+     SELECT
+       date_part('year', time) AS year,
+       land_cover_class,
+       COUNT(*) AS px,
+       ROW_NUMBER() OVER (
+         PARTITION BY date_part('year', time)
+         ORDER BY COUNT(*) DESC
+       ) AS rn
+     FROM land_cover_9_class
+     WHERE date_part('year', time) IN (2020, 2023)
+     GROUP BY 1, 2
+   )
+   SELECT y.year, ref."Name" AS dominant_class, y.px
+   FROM yearly y
+   JOIN ref_land_cover_class ref ON y.land_cover_class = ref."Index"
+   WHERE y.rn = 1
+   ORDER BY y.year;
+   ```
+
+   | year | dominant_class |    px |
+   |------|----------------|------:|
+   | 2020 | Forest         | 52014 |
+   | 2023 | Forest         | 48831 |
+
+   Forest dominated the AOI in both years (52k pixels in 2020, 49k in 2023). The 6% drop is offset by an increase in built area, suggesting modest expansion at the edges. Chart saved to `result.png`.
+
+That's the loop. Read references when you need them, run scripts instead of re-implementing them, gate paid actions on explicit confirmation, and present query → data → insight as one block.
diff --git a/tutorials/cecil-data-analysis-xql/references/datasets.md b/tutorials/cecil-data-analysis-xql/references/datasets.md
@@ -0,0 +1,78 @@
+# Picking a Cecil dataset
+
+For the live catalog of every published dataset, **call the SDK** — don't
+trust a static list (this file used to be one and went stale within weeks):
+
+```python
+import cecil
+for d in cecil.Client().list_datasets():
+    print(d.id, d.name, d.categories,
+          d.spatial_resolution.nominal, d.temporal_coverage.nominal)
+```
+
+Every `Dataset` carries `id, name, type, categories, providers, variables,
+description, usage_notes, constraints, pricing, spatial_resolution,
+temporal_coverage, …`. This file only adds the editorial layer the SDK
+*can't* give you — which dataset to pick for which question — plus the
+cross-cutting gotchas that affect SQL.
+
+## Selection guidance
+
+**Plant biomass / forest carbon** — Chloris (project-scale: 10 m from 2017+,
+30 m from 2000+), Kanop (project monitoring; the Screening tier is the
+cheap pre-flight pass), Planet Forest Carbon (Diligence 30 m, Monitoring
+3.5 m quarterly), Sylvera (long-term atlas), UMD Hansen (industry standard
+for tree loss / gain).
+
+**Land use & land cover** — Impact Observatory 9-/15-/17-Class
+(simplified → detailed → very-high-res; pick by class count + pixel size),
+JRC Global Forest Cover / Type (2020 snapshots), USDA Cropland Data Layer
+(US-only; 10 m for recent, 30 m for long history), USGS Annual NLCD
+(40 years of US LULC change), WRI SBTN Natural Lands (SBTN-aligned, 2020),
+WRI Tropical Tree Cover (tropics-only).
+
+**Soil moisture** — Lobelia Earth 30 m (field-scale), Planet Soil Water
+Content 20 m (frequent, very-high-res) or 1 km (regional, history back to
+2002).
+
+**Biodiversity** — IBAT family (IUCN Red List, KBA, WDPA, STAR) for
+species / KBA / protected-area overlaps, NHM BII 1 km / 10 km for
+biodiversity-intactness scoring.
+
+**Ecosystem integrity** — WCS Forest Landscape Integrity Index
+(landscape fragmentation, 2019 snapshot).
+
+## Cross-cutting gotchas
+
+- **LULC integer codes.** `land_cover_class` (and similar) are integer
+  codes that mean nothing without joining the variable's `reference_table`.
+  See `xarray_sql.md` for the join pattern.
+- **IBAT is vector.** Use `client.load_dataframe(...)`, **not**
+  `load_xarray`. IBAT datasets are pandas DataFrames, not gridded rasters.
+- **US-only datasets.** USDA Cropland Data Layer and USGS NLCD don't cover
+  non-US AOIs. WRI Tropical Tree Cover is tropics-only.
+- **BII history stops in 2021.** If the question is about 2022+, BII
+  won't help.
+
+## Quick "what dataset answers what question" map
+
+| User asks about... | Try... |
+|---|---|
+| Forest carbon stock or change | Chloris, Kanop, Planet Forest Carbon, Sylvera |
+| Tree-cover loss / deforestation | UMD Hansen, JRC Forest Cover |
+| Land cover class history | Impact Observatory (global) or USGS NLCD (US-only) |
+| Crop type (US) | USDA Cropland Data Layer |
+| Tropical forest extent | WRI Tropical Tree Cover, JRC Forest Cover |
+| Soil moisture / drought | Lobelia Earth, Planet Soil Water Content |
+| Threatened species overlap | IBAT IUCN Red List (vector — `load_dataframe`) |
+| Protected area overlap | IBAT WDPA (vector) |
+| Key Biodiversity Areas | IBAT KBA (vector) |
+| Biodiversity intactness | NHM BII 1km / 10km |
+| Forest landscape integrity | WCS Forest Landscape Integrity Index |
+
+## Out of scope for this skill
+
+Cecil does not cover weather forecasts, atmospheric chemistry / air
+quality, GHG concentrations, or ocean / marine datasets. If a user asks
+for one of these, tell them plainly that Cecil doesn't carry that data —
+don't force-fit a dataset that doesn't really answer the question.