[provenance] Plan: opt-in empty-chunk-aware read path (prefetch_populated_keys)

> **Provenance note**: this is a copy of the planning document for my own tracking. The upstream issue and PR (against `zarr-developers/zarr-python`) are separate and use their own framing. Do not link this from upstream.

---

# Empty-chunk-aware fast read path for zarr-python

## Context

The zagg benchmark report ([REPORT.md](https://github.com/englacial/zagg/blob/main/bench/REPORT.md)) and notebook
([layout_access_numpy.ipynb](https://github.com/englacial/zagg/blob/main/bench/layout_access_numpy.ipynb))
identify an upstream optimization in zarr-python's read path. On a sparse 1D HEALPix array
(49,152 chunks, ~1,300 populated), a full read takes 173 s — of which ~150 s is spent in
Python iteration over empty chunks that resolve to `fill_value` with zero I/O. A 30-line
recipe (`fast_read_sparse`) that LISTs populated chunks first and reads only those is **64×
faster on sparse, 3× faster on dense** (the dense win comes from higher concurrency, which
is being deferred to a separate PR per user direction).

The proposal: bring this optimization upstream into zarr-python as an opt-in flag. xarray
and dask call `__getitem__` and `get_basic_selection`, so a flag wired into the existing
read path benefits the whole ecosystem when activated via `zarr.config.set(...)` context
manager — without committing to a new public method or a default-on behavior change.

User decisions confirmed before writing this plan:
- **API surface: opt-in flag only.** No new `Array.fast_read_sparse()` method.
- **Concurrency: keep separate.** The default `async.concurrency=10` stays untouched in this PR.

## Design

### What the optimization does

Before launching reads for a selection, issue one `store.list_prefix(array_path)` call to
build a `frozenset[str]` of populated chunk/shard keys. For each chunk in the selection:

- **In set** → read normally via the codec pipeline.
- **Not in set** → fill the corresponding region of the output buffer with `fill_value` directly,
  skipping `byte_getter.get()` and codec decode. This is semantically identical to today's
  "missing chunk" path, just without the round-trip.

Filtering is done at the **shard-key level** (using `array._iter_shard_keys()` semantics)
so it works uniformly for sharded and non-sharded arrays. For sharded arrays this skips
empty *shards*; the existing per-shard partial-decode already handles empty chunks within
a populated shard efficiently.

When `read_missing_chunks=False` is set, missing chunks still raise `ChunkNotFoundError` —
the optimization just detects missingness via the populated set instead of a failed GET.

### API surface

A single flag, exposed in three equivalent ways (all wired to one internal code path):

1. **Global config**: `zarr.config.set({\"array.prefetch_populated_keys\": True})` — context-managed
   activation. xarray/dask inherit this for free.
2. **`ArrayConfig` field**: `prefetch_populated_keys: bool` alongside `read_missing_chunks` /
   `write_empty_chunks`. Set at array open via `with_config(...)` or constructor.
3. **Per-call kwarg**: `arr.get_basic_selection(..., prefetch_populated_keys=True)` and
   `arr.get_block_selection(..., prefetch_populated_keys=True)`. Overrides config for one call.

Default: **off**. The optimization is purely additive; off-path behavior is unchanged.

### Files to modify

- `src/zarr/core/config.py` — add `array.prefetch_populated_keys: False` default (line 96 region).
- `src/zarr/core/array_spec.py` — add `prefetch_populated_keys: bool` field to `ArrayConfig`
  + `ArrayConfigParams`. Mirror the existing `read_missing_chunks` plumbing (lines 22–63).
- `src/zarr/core/array.py`:
  - In `_get_selection` (line 5373): when the flag is on, build the populated-key set via
    `store_path.store.list_prefix(store_path.path)` (mirror `_shards_initialized` at line 3941),
    then split `indexed_chunks` into `present_chunks` and `missing_chunks`. Pass only
    `present_chunks` to `codec_pipeline.read()`. For `missing_chunks`, write `fill_value`
    directly into `out_buffer[out_selection]`.
  - Reuse the `read_missing_chunks=False` error-construction logic that already exists at
    lines 5466–5480 — populated-set lookup is the same signal as the current `status=\"missing\"`
    return.
  - Add per-call `prefetch_populated_keys: bool | None = None` kwarg to `get_basic_selection`
    (line ~2429), `get_block_selection` (line 3556), and corresponding async variants. When
    not None, override the config field for that call.

### Reused primitives

- `_shards_initialized` (`src/zarr/core/array.py:3941`) — pattern for `list_prefix` +
  intersection with expected keys. The new helper returns a `frozenset` for O(1) membership.
- `_iter_shard_keys` (`src/zarr/core/array.py:5186`) — already abstracts sharded vs.
  non-sharded key iteration.
- `metadata.encode_chunk_key()` (`src/zarr/core/metadata/v3.py:584`) — used to convert
  chunk coordinates to store keys for membership lookup.
- `fill_value_or_default()` (used in `codec_pipeline.py:377, 407`) — the existing
  fill-value materialization used by the missing-chunk path.
- `_relativize_path()` (used in `array.py:3968`) — for prefix-relative key normalization.

### Sharding interaction

For the sparse-shards case (e.g. 192 shards, 5 populated): list_prefix returns 5 shard keys;
the indexer enumerates all 192 shards × inner chunks; we filter at the shard level and the
inner-chunk loop fills `fill_value` for the 187 empty shards without invoking ShardingCodec.

For populated shards with sparse inner chunks: no change — `ShardingCodec`'s existing
partial-decode path is already efficient (reads shard index once, O(1) per inner chunk).

### Cost-model caveat

`list_prefix` cost varies by store (LocalStore: ms; FsspecStore on S3: 50–500 ms paginated;
HTTP-only stores: may not implement listing well). On a fully-dense store on a slow LIST
backend, the prefetch is pure overhead — hence the opt-in default. Document this in the
config docstring with a rule of thumb: *\"enable for sparse arrays where most chunks are
empty, or any time you're reading the whole array on a store with cheap LIST.\"*

## Verification

### Tests to add (in `tests/test_indexing.py` and `tests/test_array.py`)

1. **Correctness vs. baseline**: same array read with flag on/off → byte-identical output.
   Parametrize over: sparse 1D, dense 1D, sparse 2D, sharded sparse, sharded dense, all-empty,
   all-populated, integer-fill-value, NaN-fill-value, structured dtype.
2. **Selection types**: `arr[:]`, `arr[10:1000]`, `arr.get_block_selection((3,))`,
   `arr.get_orthogonal_selection(...)` — all respect the flag.
3. **Activation paths**: global config context manager, `ArrayConfig` field, per-call kwarg —
   all produce identical results.
4. **`read_missing_chunks=False` interaction**: missing chunks raise the same
   `ChunkNotFoundError` regardless of flag.
5. **Race tolerance**: write a chunk between `list_prefix` and the read call; the read still
   resolves correctly (either reads new data or fills with `fill_value` — both acceptable;
   today's `arr[:]` has the same race).
6. **Store coverage**: run the full matrix on `MemoryStore`, `LocalStore`, `FsspecStore`
   (memory backend), `ZipStore`. Skip stores that don't support `list_prefix`.

### Benchmarks (in `bench/`)

Add a short `bench/empty_chunks.py` reproducing the report's setup at smaller scale (e.g.
1,024-chunk grid, 32 populated) on `MemoryStore` and `LocalStore`. Report wall time for
flag-off vs. flag-on. Target: ≥10× speedup on sparse, ≤5% regression on dense.

### Manual verification

```bash
git checkout -b feat/prefetch-populated-keys
# implement
hatch env run -e test pytest tests/test_indexing.py tests/test_array.py -k \"prefetch or sparse\" -x
hatch env run -e test pytest tests/  # full suite — no regressions
python bench/empty_chunks.py
```

### News fragment

Add `changes/<PR>.feature.md` (the repo uses towncrier-style fragments; see
`changes/3679.feature.md`). One-line summary referencing the config key.

## Branch and PR

- Branch: `feat/prefetch-populated-keys` off `main`.
- PR title: `feat: opt-in empty-chunk-aware read path via prefetch_populated_keys`.
- **Open an issue first** — the contributing guide
  (`docs/contributing.md`, \"Enhancement proposals\" + \"AI-assisted contributions\" sections)
  explicitly asks for issue-first discussion of new features and large changes. Reference the
  zagg report and benchmark numbers in the issue.
- Keep the diff small (the description above is ~150 lines of code + tests). The contributing
  guide notes that large AI-assisted PRs may be closed for reviewability.

## Out of scope (explicitly deferred)

- Higher default `async.concurrency` (3× dense-read win in the report). Separate PR.
- New `Array.fast_read_sparse()` public method. Trivially built on top of the flag if
  needed later; not committed in this PR.
- Default-on behavior. Opt-in for now; revisit after real-world feedback and store-cost
  data.
- `dask.array.from_zarr()` graph pruning (report §6.4) — downstream in dask, not zarr-python.
- xdggs MOC-from-populated-chunks (report §6.2) — downstream in xdggs.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[provenance] Plan: opt-in empty-chunk-aware read path (prefetch_populated_keys) #1

Empty-chunk-aware fast read path for zarr-python

Context

Design

What the optimization does

API surface

Files to modify

Reused primitives

Sharding interaction

Cost-model caveat

Verification

Tests to add (in `tests/test_indexing.py` and `tests/test_array.py`)

Benchmarks (in `bench/`)

Manual verification

News fragment

Branch and PR

Out of scope (explicitly deferred)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[provenance] Plan: opt-in empty-chunk-aware read path (prefetch_populated_keys) #1

Description

Empty-chunk-aware fast read path for zarr-python

Context

Design

What the optimization does

API surface

Files to modify

Reused primitives

Sharding interaction

Cost-model caveat

Verification

Tests to add (in tests/test_indexing.py and tests/test_array.py)

Benchmarks (in bench/)

Manual verification

News fragment

Branch and PR

Out of scope (explicitly deferred)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Tests to add (in `tests/test_indexing.py` and `tests/test_array.py`)

Benchmarks (in `bench/`)