Skip to content

[provenance] Plan: opt-in empty-chunk-aware read path (prefetch_populated_keys) #1

@espg

Description

@espg

Provenance note: this is a copy of the planning document for my own tracking. The upstream issue and PR (against zarr-developers/zarr-python) are separate and use their own framing. Do not link this from upstream.


Empty-chunk-aware fast read path for zarr-python

Context

The zagg benchmark report (REPORT.md) and notebook
(layout_access_numpy.ipynb)
identify an upstream optimization in zarr-python's read path. On a sparse 1D HEALPix array
(49,152 chunks, ~1,300 populated), a full read takes 173 s — of which ~150 s is spent in
Python iteration over empty chunks that resolve to fill_value with zero I/O. A 30-line
recipe (fast_read_sparse) that LISTs populated chunks first and reads only those is 64×
faster on sparse, 3× faster on dense
(the dense win comes from higher concurrency, which
is being deferred to a separate PR per user direction).

The proposal: bring this optimization upstream into zarr-python as an opt-in flag. xarray
and dask call __getitem__ and get_basic_selection, so a flag wired into the existing
read path benefits the whole ecosystem when activated via zarr.config.set(...) context
manager — without committing to a new public method or a default-on behavior change.

User decisions confirmed before writing this plan:

  • API surface: opt-in flag only. No new Array.fast_read_sparse() method.
  • Concurrency: keep separate. The default async.concurrency=10 stays untouched in this PR.

Design

What the optimization does

Before launching reads for a selection, issue one store.list_prefix(array_path) call to
build a frozenset[str] of populated chunk/shard keys. For each chunk in the selection:

  • In set → read normally via the codec pipeline.
  • Not in set → fill the corresponding region of the output buffer with fill_value directly,
    skipping byte_getter.get() and codec decode. This is semantically identical to today's
    "missing chunk" path, just without the round-trip.

Filtering is done at the shard-key level (using array._iter_shard_keys() semantics)
so it works uniformly for sharded and non-sharded arrays. For sharded arrays this skips
empty shards; the existing per-shard partial-decode already handles empty chunks within
a populated shard efficiently.

When read_missing_chunks=False is set, missing chunks still raise ChunkNotFoundError
the optimization just detects missingness via the populated set instead of a failed GET.

API surface

A single flag, exposed in three equivalent ways (all wired to one internal code path):

  1. Global config: zarr.config.set({\"array.prefetch_populated_keys\": True}) — context-managed
    activation. xarray/dask inherit this for free.
  2. ArrayConfig field: prefetch_populated_keys: bool alongside read_missing_chunks /
    write_empty_chunks. Set at array open via with_config(...) or constructor.
  3. Per-call kwarg: arr.get_basic_selection(..., prefetch_populated_keys=True) and
    arr.get_block_selection(..., prefetch_populated_keys=True). Overrides config for one call.

Default: off. The optimization is purely additive; off-path behavior is unchanged.

Files to modify

  • src/zarr/core/config.py — add array.prefetch_populated_keys: False default (line 96 region).
  • src/zarr/core/array_spec.py — add prefetch_populated_keys: bool field to ArrayConfig
    • ArrayConfigParams. Mirror the existing read_missing_chunks plumbing (lines 22–63).
  • src/zarr/core/array.py:
    • In _get_selection (line 5373): when the flag is on, build the populated-key set via
      store_path.store.list_prefix(store_path.path) (mirror _shards_initialized at line 3941),
      then split indexed_chunks into present_chunks and missing_chunks. Pass only
      present_chunks to codec_pipeline.read(). For missing_chunks, write fill_value
      directly into out_buffer[out_selection].
    • Reuse the read_missing_chunks=False error-construction logic that already exists at
      lines 5466–5480 — populated-set lookup is the same signal as the current status=\"missing\"
      return.
    • Add per-call prefetch_populated_keys: bool | None = None kwarg to get_basic_selection
      (line ~2429), get_block_selection (line 3556), and corresponding async variants. When
      not None, override the config field for that call.

Reused primitives

  • _shards_initialized (src/zarr/core/array.py:3941) — pattern for list_prefix +
    intersection with expected keys. The new helper returns a frozenset for O(1) membership.
  • _iter_shard_keys (src/zarr/core/array.py:5186) — already abstracts sharded vs.
    non-sharded key iteration.
  • metadata.encode_chunk_key() (src/zarr/core/metadata/v3.py:584) — used to convert
    chunk coordinates to store keys for membership lookup.
  • fill_value_or_default() (used in codec_pipeline.py:377, 407) — the existing
    fill-value materialization used by the missing-chunk path.
  • _relativize_path() (used in array.py:3968) — for prefix-relative key normalization.

Sharding interaction

For the sparse-shards case (e.g. 192 shards, 5 populated): list_prefix returns 5 shard keys;
the indexer enumerates all 192 shards × inner chunks; we filter at the shard level and the
inner-chunk loop fills fill_value for the 187 empty shards without invoking ShardingCodec.

For populated shards with sparse inner chunks: no change — ShardingCodec's existing
partial-decode path is already efficient (reads shard index once, O(1) per inner chunk).

Cost-model caveat

list_prefix cost varies by store (LocalStore: ms; FsspecStore on S3: 50–500 ms paginated;
HTTP-only stores: may not implement listing well). On a fully-dense store on a slow LIST
backend, the prefetch is pure overhead — hence the opt-in default. Document this in the
config docstring with a rule of thumb: "enable for sparse arrays where most chunks are
empty, or any time you're reading the whole array on a store with cheap LIST."

Verification

Tests to add (in tests/test_indexing.py and tests/test_array.py)

  1. Correctness vs. baseline: same array read with flag on/off → byte-identical output.
    Parametrize over: sparse 1D, dense 1D, sparse 2D, sharded sparse, sharded dense, all-empty,
    all-populated, integer-fill-value, NaN-fill-value, structured dtype.
  2. Selection types: arr[:], arr[10:1000], arr.get_block_selection((3,)),
    arr.get_orthogonal_selection(...) — all respect the flag.
  3. Activation paths: global config context manager, ArrayConfig field, per-call kwarg —
    all produce identical results.
  4. read_missing_chunks=False interaction: missing chunks raise the same
    ChunkNotFoundError regardless of flag.
  5. Race tolerance: write a chunk between list_prefix and the read call; the read still
    resolves correctly (either reads new data or fills with fill_value — both acceptable;
    today's arr[:] has the same race).
  6. Store coverage: run the full matrix on MemoryStore, LocalStore, FsspecStore
    (memory backend), ZipStore. Skip stores that don't support list_prefix.

Benchmarks (in bench/)

Add a short bench/empty_chunks.py reproducing the report's setup at smaller scale (e.g.
1,024-chunk grid, 32 populated) on MemoryStore and LocalStore. Report wall time for
flag-off vs. flag-on. Target: ≥10× speedup on sparse, ≤5% regression on dense.

Manual verification

git checkout -b feat/prefetch-populated-keys
# implement
hatch env run -e test pytest tests/test_indexing.py tests/test_array.py -k \"prefetch or sparse\" -x
hatch env run -e test pytest tests/  # full suite — no regressions
python bench/empty_chunks.py

News fragment

Add changes/<PR>.feature.md (the repo uses towncrier-style fragments; see
changes/3679.feature.md). One-line summary referencing the config key.

Branch and PR

  • Branch: feat/prefetch-populated-keys off main.
  • PR title: feat: opt-in empty-chunk-aware read path via prefetch_populated_keys.
  • Open an issue first — the contributing guide
    (docs/contributing.md, "Enhancement proposals" + "AI-assisted contributions" sections)
    explicitly asks for issue-first discussion of new features and large changes. Reference the
    zagg report and benchmark numbers in the issue.
  • Keep the diff small (the description above is ~150 lines of code + tests). The contributing
    guide notes that large AI-assisted PRs may be closed for reviewability.

Out of scope (explicitly deferred)

  • Higher default async.concurrency (3× dense-read win in the report). Separate PR.
  • New Array.fast_read_sparse() public method. Trivially built on top of the flag if
    needed later; not committed in this PR.
  • Default-on behavior. Opt-in for now; revisit after real-world feedback and store-cost
    data.
  • dask.array.from_zarr() graph pruning (report §6.4) — downstream in dask, not zarr-python.
  • xdggs MOC-from-populated-chunks (report §6.2) — downstream in xdggs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions