cybergis · amrit110 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -10,6 +10,8 @@ The format is based on Keep a Changelog, and the project follows Semantic Versio
 
 ### Added
 
+- **OlmoEarth v1/v1.1 embedder (`olmoearth`).** Adds support for the [OlmoEarth](https://huggingface.co/collections/allenai/olmoearth) foundation model family from Allen AI, trained on the Major TOM dataset. All 7 released variants are supported: `nano`, `tiny`, `base`, `large` (v1) and `nano_v1_1`, `tiny_v1_1`, `base_v1_1` (v1.1), with embedding dimensions 128/192/768/1024. The adapter fetches all 12 Sentinel-2 L2A bands from GEE in OlmoEarth's native band-set order, applies per-band mean±2σ normalization (OlmoEarth COMPUTED strategy), and encodes with the FlexiViT encoder. Both `pooled` and `grid` output modes are supported. `patch_size` (default 4) and `image_size` (default 256) are configurable via `model_config` or environment variables. Requires the `olmoearth-pretrain-minimal` package: `pip install rs-embed[olmoearth]`.
+
 - **GEE fetch statistics reporting in `export_batch`.** When `show_progress=True`, a `[gee_fetch]` summary line is now printed to stderr after each prefetch chunk completes, reporting total planned fetches, completed, failed, cache hits, and the most recently processed point/sensor. This gives users visibility into GEE quota consumption, cache reuse, and whether runtime is dominated by fetching vs. inference. No output is emitted when `show_progress=False` or when no GEE provider is involved (e.g. precomputed models). The underlying `FetchStats` class in `tools/progress.py` is thread-safe and accumulates counts cumulatively across chunks.
 
 ### Fixed

diff --git a/README.md b/README.md
@@ -125,6 +125,7 @@ This is a convenience index with basic model info only (for quick scanning / lin
 | `terrafm`         | S2 12-band / S1 VV-VH    | 10m                 | [ICLR 2026](https://arxiv.org/abs/2506.06281)                                   | [link](https://github.com/mbzuai-oryx/TerraFM)               |
 | `thor`            | S2 10-band               | 10m                 | [arXiv 2026](https://arxiv.org/abs/2601.16011)                                  | [link](https://github.com/FM4CS/THOR)                        |
 | `agrifm`          | S2 time series (10-band) | 10m                 | [RSE 2026](https://www.sciencedirect.com/science/article/pii/S0034425726000040) | [link](https://github.com/flyakon/AgriFM)                    |
+| `olmoearth`       | S2 L2A 12-band           | 10m                 | [arXiv 2025](https://arxiv.org/abs/2511.13655)                                  | [link](https://huggingface.co/collections/allenai/olmoearth) |
 
 Resolution here means the default provider/source fetch resolution used by the adapter, not the final resized tensor shape seen by the model.
 

diff --git a/docs/models.md b/docs/models.md
@@ -57,6 +57,7 @@ Some detail-page filenames still use older names for compatibility, but the cano
 | `anysat`          | S2 10-band time series          | 768  | 10m                | multi-frame      | JEPA; `s2_dates` DOY side input                         | [detail](models/anysat.md)     |
 | `galileo`         | S2 10-band time series          | 128  | 10m                | multi-frame      | nano default; month tokens                              | [detail](models/galileo.md)    |
 | `agrifm`          | S2 10-band time series          | 1024 | 10m                | multi-frame      | Video Swin; fixed `T` frame stack                       | [detail](models/agrifm.md)     |
+| `olmoearth`       | S2 L2A 12-band                  | 128–1024 | 10m            | single composite | FlexiViT; 4 sizes (nano/tiny/base/large); requires `[olmoearth]` extra | [detail](models/olmoearth.md) |
 
 ---
 

diff --git a/docs/models/olmoearth.md b/docs/models/olmoearth.md
@@ -0,0 +1,213 @@
+# OlmoEarth (`olmoearth`)
+
+## Quick Facts
+
+| Field                | Value                                                                                                     |
+| -------------------- | --------------------------------------------------------------------------------------------------------- |
+| Model ID             | `olmoearth`                                                                                               |
+| Family / Backbone    | OlmoEarth v1/v1.1 — FlexiViT encoder (ViT-style) trained on the Major TOM dataset                       |
+| Adapter type         | `on-the-fly`                                                                                              |
+| Model config keys    | `variant` (default: `nano`), `patch_size` (default: `4`), `image_size` (default: `256`)                  |
+| Training alignment   | High (S2 L2A 12-band; native 10 m resolution; per-band mean±2σ normalization matches training pipeline)   |
+
+!!! success "OlmoEarth In 30 Seconds"
+    OlmoEarth is a **multi-modal geospatial foundation model** from Allen AI, trained on the Major TOM dataset with Sentinel-2 L2A as the primary modality. It uses a FlexiViT encoder that accepts variable patch sizes, enabling flexible spatial resolution trade-offs. In `rs-embed`, the adapter fetches all **12 S2 L2A bands** and encodes them in a single forward pass.
+
+    Key characteristics:
+    - All 12 S2 L2A bands in the OlmoEarth band-set order (10 m → 20 m → 60 m groups)
+    - Per-band normalization using OlmoEarth's COMPUTED strategy (mean ± 2σ)
+    - 4 size variants in v1 (`nano`/`tiny`/`base`/`large`) and 3 in v1.1 (`nano_v1_1`/`tiny_v1_1`/`base_v1_1`)
+    - `patch_size` controls the spatial token density (1–8); default `4` matches the official inference example
+    - Input image resized to `image_size` (default 256) before encoding
+    - Requires `olmoearth-pretrain-minimal` (`pip install rs-embed[olmoearth]`)
+
+---
+
+## Input Contract
+
+| Field                 | Value                                                                              |
+| --------------------- | ---------------------------------------------------------------------------------- |
+| Backend               | provider only (`gee` / `auto`)                                                     |
+| `TemporalSpec`        | `range` or `year` (normalized via shared helper; year → full year composite)       |
+| Default collection    | `COPERNICUS/S2_SR_HARMONIZED`                                                      |
+| Default bands (order) | `B2, B3, B4, B8, B5, B6, B7, B8A, B11, B12, B1, B9`                              |
+| Default fetch         | `scale_m=10`, `cloudy_pct=30`, `composite="median"`                                |
+| `input_chw`           | `CHW`, `C=12` in the band order above, raw SR DN `0..10000`                        |
+| Side inputs           | timestamps (derived from temporal midpoint), none required from user                |
+
+The band order matches OlmoEarth's internal `Modality.SENTINEL2_L2A` definition:
+three band sets (10 m, 20 m, 60 m) totaling 12 channels.
+
+---
+
+## Preprocessing Pipeline
+
+```mermaid
+flowchart LR
+    INPUT["S2 12-band raw DN"] --> NORM["Per-band mean±2σ\nnormalization"]
+    NORM --> RESIZE["Resize to image_size\n(default 256×256)"]
+    RESIZE --> SAMPLE["Build MaskedOlmoEarthSample\n(B=1, H, W, T=1, C=12)"]
+    SAMPLE --> ENC["FlexiViT encoder\npatch_size=4 (default)"]
+    ENC --> POOL["Pool over T×BandSets\n→ (B, H', W', D)"]
+    POOL --> OUTPUT{Output mode}
+    OUTPUT -- pooled --> VEC["Global mean/max\n→ (D,) vector"]
+    OUTPUT -- grid --> GRID["Spatial token map\n(D, H', W')"]
+```
+
+---
+
+## Architecture Concept
+
+```mermaid
+flowchart LR
+    S2["S2 L2A\n12 bands\n3 band sets"] --> PE["FlexiViT\npatch embed\n(patch_size 1–8)"]
+    TS["Timestamps\n(day, month, year)"] --> TE["Temporal + month\nembeddings"]
+    PE --> ATTN["Transformer\nencoder\n(depth by variant)"]
+    TE --> ATTN
+    ATTN --> OUT["tokens:\n(B, H', W', T, S, D)"]
+    OUT --> MEAN["Mean over T, S"]
+    MEAN --> RESULT["Spatial grid\n(D, H', W')"]
+```
+
+The encoder output is a 6-D tensor `(B, H', W', T=1, S, D)` where `S` is the number of band sets (3 for v1, 1 for v1.1 due to the linear patch embedding change). All pooling is applied after the encoder.
+
+---
+
+## Model-specific Settings
+
+### `variant`
+
+Selects the model size and version. Weights are automatically downloaded from Hugging Face on first use.
+
+| Variant      | Version | Encoder Dim | Depth | HuggingFace Repo                   |
+| ------------ | ------- | ----------- | ----- | ---------------------------------- |
+| `nano`       | v1      | 128         | 4     | `allenai/OlmoEarth-v1-Nano`        |
+| `tiny`       | v1      | 192         | 12    | `allenai/OlmoEarth-v1-Tiny`        |
+| `base`       | v1      | 768         | 12    | `allenai/OlmoEarth-v1-Base`        |
+| `large`      | v1      | 1024        | 24    | `allenai/OlmoEarth-v1-Large`       |
+| `nano_v1_1`  | v1.1    | 128         | 4     | `allenai/OlmoEarth-v1_1-Nano`      |
+| `tiny_v1_1`  | v1.1    | 192         | 12    | `allenai/OlmoEarth-v1_1-Tiny`      |
+| `base_v1_1`  | v1.1    | 768         | 12    | `allenai/OlmoEarth-v1_1-Base`      |
+
+!!! note "v1 vs v1.1 architecture difference"
+    v1 uses a Conv2D-based patch embedding, producing 3 separate band-set token groups per spatial location.
+    v1.1 uses a linear patch embedding (`use_linear_patch_embed=True`) that merges band sets into a single token stream. Both versions produce the same output dimensionality after pooling.
+
+Short aliases are accepted: `nano_11`, `tiny_11`, `base_11` for v1.1 variants; `nano_v1`, `tiny_v1`, `base_v1`, `large_v1` for v1 variants.
+
+### `patch_size`
+
+Controls the spatial patch size for the FlexiViT encoder. Smaller values produce more spatial tokens (higher resolution) at the cost of longer inference time.
+
+| `patch_size` | Tokens (256×256 image) | Note                              |
+| ------------ | ---------------------- | --------------------------------- |
+| `4`          | 64 × 64 = 4096         | Default; more spatially detailed  |
+| `8`          | 32 × 32 = 1024         | Faster; coarser spatial grid      |
+| `2`          | 128 × 128 = 16384      | Very detailed; significantly slower |
+
+### `image_size`
+
+Target pixel size for the resize step. The fetched patch is always resized to `(image_size, image_size)` before encoding. Must be divisible by `patch_size`.
+
+Default: `256` (matching the OlmoEarth training tile size).
+
+---
+
+## Output Semantics
+
+### Pooled (`OutputSpec.pooled()`)
+
+The encoder output `(B, H', W', T=1, S, D)` is pooled over all spatial, temporal, and band-set dimensions via the OlmoEarth built-in `pool_unmasked_tokens()`. This produces a `(D,)` vector.
+
+`pooling="mean"` (default) computes mean; `pooling="max"` computes max over token positions.
+
+### Grid (`OutputSpec.grid()`)
+
+Returns a `(D, H', W')` spatial token map as an `xarray.DataArray` with dimensions `(d, y, x)`. The temporal (T=1) and band-set (S) dimensions are averaged out; only the spatial token grid is retained.
+
+Grid size depends on `image_size` and `patch_size`:
+```
+H' = W' = image_size // patch_size
+```
+For defaults (256, patch_size=4): `64 × 64` grid.
+
+---
+
+## Environment Variables
+
+| Variable                         | Default  | Effect                                              |
+| -------------------------------- | -------- | --------------------------------------------------- |
+| `RS_EMBED_OLMOEARTH_VARIANT`     | `nano`   | Default model variant when `model_config` not given |
+| `RS_EMBED_OLMOEARTH_PATCH_SIZE`  | `4`      | Default patch size when `model_config` not given    |
+| `RS_EMBED_OLMOEARTH_IMAGE_SIZE`  | `256`    | Default image resize target                         |
+| `RS_EMBED_OLMOEARTH_FETCH_WORKERS` | `8`    | Parallel GEE fetch workers for batch calls          |
+| `RS_EMBED_OLMOEARTH_BATCH_SIZE`  | `4` (CPU) / `16` (CUDA) | Inference batch size for `get_embeddings_batch_from_inputs` |
+
+---
+
+## Installation
+
+OlmoEarth requires an additional package not included in the base `rs-embed` install:
+
+```bash
+pip install rs-embed[olmoearth]
+# or
+uv pip install olmoearth-pretrain-minimal
+```
+
+---
+
+## Usage Examples
+
+```python
+import rs_embed as rs
+from rs_embed.core.specs import BBox, TemporalSpec, OutputSpec
+
+# Pooled embedding with default nano variant
+emb = rs.get_embedding(
+    "olmoearth",
+    spatial=BBox(minlon=-2.0, minlat=6.0, maxlon=-1.9, maxlat=6.1),
+    temporal=TemporalSpec.year(2022),
+    output=OutputSpec.pooled(),
+)
+print(emb.data.shape)   # (128,) for nano
+
+# Use base variant
+emb_base = rs.get_embedding(
+    "olmoearth",
+    spatial=BBox(minlon=-2.0, minlat=6.0, maxlon=-1.9, maxlat=6.1),
+    temporal=TemporalSpec.year(2022),
+    output=OutputSpec.pooled(),
+    model_config={"variant": "base"},
+)
+print(emb_base.data.shape)   # (768,) for base
+
+# Grid embedding (spatial token map)
+emb_grid = rs.get_embedding(
+    "olmoearth",
+    spatial=BBox(minlon=-2.0, minlat=6.0, maxlon=-1.9, maxlat=6.1),
+    temporal=TemporalSpec.year(2022),
+    output=OutputSpec.grid(),
+    model_config={"variant": "nano", "patch_size": 8},
+)
+print(emb_grid.data.shape)   # (128, 32, 32) for nano with patch_size=8
+
+# Class-based API for repeated calls
+from rs_embed.model import Model
+from rs_embed.core.specs import PointBuffer
+
+model = Model("olmoearth", model_config={"variant": "tiny"})
+embeddings = model.get_embeddings_batch([
+    PointBuffer(lon=-1.95, lat=6.05, buffer_m=1000),
+    PointBuffer(lon=-2.10, lat=6.20, buffer_m=1000),
+], temporal=TemporalSpec.year(2022))
+```
+
+---
+
+## Notes and Caveats
+
+- The OlmoEarth normalizer clips to `mean ± 2σ` before rescaling to `[0, 1]`. Values outside this range are clipped, not discarded.
+- `patch_size` is a **model input** (FlexiViT accepts variable patch sizes), not a preprocessing hyperparameter. Different `patch_size` values may produce embeddings with different spatial characteristics.
+- The `large` variant is only available in v1 (no v1.1 large release at time of writing).
+- Weights are cached by `huggingface_hub` in the default HF cache directory.
diff --git a/docs/models/prithvi.md b/docs/models/prithvi.md
@@ -19,7 +19,7 @@
     In `rs-embed`, its most important characteristics are:
 
     - **required** temporal (`year, day_of_year`) and location (`lat, lon`) side inputs auto-derived by the adapter: see [Input Contract](#input-contract)
-    - 30 m default `sensor.scale_m`, not the more common S2 10 m default — a frequent source of silent drift: see [Reproducibility Notes](#reproducibility-notes)
+    - 30 m default `sensor.scale_m`, not the more common S2 10 m default — a frequent source of silent drift: see [Environment Variables / Tuning Knobs](#environment-variables-tuning-knobs)
     - `resize` vs `pad` preprocessing changes token geometry and should be treated as part of the experiment, not as a cosmetic knob: see [Environment Variables / Tuning Knobs](#environment-variables-tuning-knobs)
 
 ---

diff --git a/docs/models/satmae.md b/docs/models/satmae.md
@@ -17,7 +17,7 @@
     In `rs-embed`, its most important characteristics are:
 
     - RGB-only (`B4,B3,B2`); raw SR is converted to `uint8` before model preprocessing: see [Preprocessing Pipeline](#preprocessing-pipeline)
-    - token path is always used (`mask_ratio=0.0`), and any CLS token is auto-removed before pooling/grid: see [Output Semantics](#output-semantics)
+    - token path is always used (`mask_ratio=0.0`), and any CLS token is auto-removed before pooling/grid: see [Reference](#reference)
     - checkpoint selection via `RS_EMBED_SATMAE_ID` (Hugging Face model ID) — default targets the fMoW large checkpoint: see [Environment Variables / Tuning Knobs](#environment-variables-tuning-knobs)
 
 ---

diff --git a/docs/models_reference.md b/docs/models_reference.md
@@ -25,7 +25,7 @@ Read this section before comparing any model that accepts `TemporalSpec.range(..
 
 For most on-the-fly adapters, `TemporalSpec.range(start, end)` means "filter imagery in `[start, end)` and build one composite patch for model input," usually with `median` and optionally `mosaic` through `SensorSpec.composite`.
 
-The multi-frame adapters `agrifm`, `anysat`, and `galileo` instead split the requested range into sub-windows and composite one frame per bin. Current single-composite adapters include `remoteclip`, `satmae`, `satmaepp`, `satmaepp_s2_10b`, `scalemae`, `wildsat`, `prithvi`, `terrafm`, `terramind`, `dofa`, `fomo`, `thor`, and `satvision`.
+The multi-frame adapters `agrifm`, `anysat`, and `galileo` instead split the requested range into sub-windows and composite one frame per bin. Current single-composite adapters include `remoteclip`, `satmae`, `satmaepp`, `satmaepp_s2_10b`, `scalemae`, `wildsat`, `prithvi`, `terrafm`, `terramind`, `dofa`, `fomo`, `thor`, `satvision`, and `olmoearth`.
 
 ### Multi-frame Semantics
 
@@ -68,6 +68,7 @@ Use this table to avoid unfair comparisons between plain image encoders and adap
 | `thor`            | Yes (`S1`/`S2`)               | Yes (select one modality per call: `s1` or `s2`)          | No                                                            | No hard extra metadata (optional S1 options: orbit, linear/DB path) |
 | `agrifm`          | No (this adapter path)        | No                                                        | No extra side tensor, but temporal stack `[T,C,H,W]` required | Temporal coverage is important (no separate metadata tensor)        |
 | `satvision`       | No (this adapter path)        | No                                                        | No separate side tensor                                       | Yes: strict 14-channel order/calibration schema (band semantics)    |
+| `olmoearth`       | Yes (multi-modal architecture) | S2 L2A only in this adapter                              | Yes (image + mask + timestamps; all derived automatically)    | No hard extra metadata (timestamps derived from temporal midpoint)  |
 
 In practice, the most obviously multi-input models here are `prithvi` (image plus temporal and location coordinates), `anysat` (time series plus `s2_dates`), `galileo` (image-derived tensors plus masks and `months`), `dofa` (image plus wavelengths), and `scalemae` (image plus `input_res_m`).
 
@@ -93,6 +94,7 @@ This table only lists env vars that materially change model input construction o
 | `thor`            | `RS_EMBED_THOR_IMG`, `RS_EMBED_THOR_NORMALIZE`, plus modality and sensor-side options (`s2`/`s1`)                                                                                                                                      |
 | `agrifm`          | `RS_EMBED_AGRIFM_IMG`, `RS_EMBED_AGRIFM_NORM`, `RS_EMBED_AGRIFM_FRAMES`                                                                                                                                                                |
 | `satvision`       | `RS_EMBED_SATVISION_TOA_IMG`, `RS_EMBED_SATVISION_TOA_NORM`, channel-index and calibration env keys                                                                                                                                    |
+| `olmoearth`       | `RS_EMBED_OLMOEARTH_VARIANT`, `RS_EMBED_OLMOEARTH_IMAGE_SIZE`, `RS_EMBED_OLMOEARTH_PATCH_SIZE`                                                                                                                                          |
 
 ### Practical Guidance
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -95,6 +95,7 @@ nav:
           - AnySat: models/anysat.md
           - Galileo: models/galileo.md
           - AgriFM: models/agrifm.md
+          - OLMoEarth: models/olmoearth.md
   - API:
       - Overview: api.md
       - Specs & Data Structures: api_specs.md

diff --git a/pyproject.toml b/pyproject.toml
@@ -68,9 +68,13 @@ terramind = [
   # TerraMind still loads its backbone through the TerraTorch registry.
   "terratorch==1.2.1",
 ]
+olmoearth = [
+  "olmoearth-pretrain-minimal>=0.0.5",
+]
 full = [
   "matplotlib>=3.10",
   "terratorch==1.2.1",
+  "olmoearth-pretrain-minimal>=0.0.5",
 ]
 dev = [
   "pytest>=7.4",

diff --git a/src/rs_embed/embedders/catalog.py b/src/rs_embed/embedders/catalog.py
@@ -21,6 +21,7 @@
     "thor": ("onthefly_thor", "THORBaseEmbedder"),
     "agrifm": ("onthefly_agrifm", "AgriFMEmbedder"),
     "satvision": ("onthefly_satvision_toa", "SatVisionTOAEmbedder"),
+    "olmoearth": ("onthefly_olmoearth", "OlmoEarthEmbedder"),
 }
 
 MODEL_ALIASES: dict[str, str] = {