Provenance note: this is a copy of the planning document for my own tracking. The upstream issue and PR (against zarr-developers/zarr-python) are separate and use their own framing. Do not link this from upstream.
Empty-chunk-aware fast read path for zarr-python
Context
The zagg benchmark report (REPORT.md) and notebook
(layout_access_numpy.ipynb)
identify an upstream optimization in zarr-python's read path. On a sparse 1D HEALPix array
(49,152 chunks, ~1,300 populated), a full read takes 173 s — of which ~150 s is spent in
Python iteration over empty chunks that resolve to fill_value with zero I/O. A 30-line
recipe (fast_read_sparse) that LISTs populated chunks first and reads only those is 64×
faster on sparse, 3× faster on dense (the dense win comes from higher concurrency, which
is being deferred to a separate PR per user direction).
The proposal: bring this optimization upstream into zarr-python as an opt-in flag. xarray
and dask call __getitem__ and get_basic_selection, so a flag wired into the existing
read path benefits the whole ecosystem when activated via zarr.config.set(...) context
manager — without committing to a new public method or a default-on behavior change.
User decisions confirmed before writing this plan:
- API surface: opt-in flag only. No new
Array.fast_read_sparse() method.
- Concurrency: keep separate. The default
async.concurrency=10 stays untouched in this PR.
Design
What the optimization does
Before launching reads for a selection, issue one store.list_prefix(array_path) call to
build a frozenset[str] of populated chunk/shard keys. For each chunk in the selection:
- In set → read normally via the codec pipeline.
- Not in set → fill the corresponding region of the output buffer with
fill_value directly,
skipping byte_getter.get() and codec decode. This is semantically identical to today's
"missing chunk" path, just without the round-trip.
Filtering is done at the shard-key level (using array._iter_shard_keys() semantics)
so it works uniformly for sharded and non-sharded arrays. For sharded arrays this skips
empty shards; the existing per-shard partial-decode already handles empty chunks within
a populated shard efficiently.
When read_missing_chunks=False is set, missing chunks still raise ChunkNotFoundError —
the optimization just detects missingness via the populated set instead of a failed GET.
API surface
A single flag, exposed in three equivalent ways (all wired to one internal code path):
- Global config:
zarr.config.set({\"array.prefetch_populated_keys\": True}) — context-managed
activation. xarray/dask inherit this for free.
ArrayConfig field: prefetch_populated_keys: bool alongside read_missing_chunks /
write_empty_chunks. Set at array open via with_config(...) or constructor.
- Per-call kwarg:
arr.get_basic_selection(..., prefetch_populated_keys=True) and
arr.get_block_selection(..., prefetch_populated_keys=True). Overrides config for one call.
Default: off. The optimization is purely additive; off-path behavior is unchanged.
Files to modify
src/zarr/core/config.py — add array.prefetch_populated_keys: False default (line 96 region).
src/zarr/core/array_spec.py — add prefetch_populated_keys: bool field to ArrayConfig
ArrayConfigParams. Mirror the existing read_missing_chunks plumbing (lines 22–63).
src/zarr/core/array.py:
- In
_get_selection (line 5373): when the flag is on, build the populated-key set via
store_path.store.list_prefix(store_path.path) (mirror _shards_initialized at line 3941),
then split indexed_chunks into present_chunks and missing_chunks. Pass only
present_chunks to codec_pipeline.read(). For missing_chunks, write fill_value
directly into out_buffer[out_selection].
- Reuse the
read_missing_chunks=False error-construction logic that already exists at
lines 5466–5480 — populated-set lookup is the same signal as the current status=\"missing\"
return.
- Add per-call
prefetch_populated_keys: bool | None = None kwarg to get_basic_selection
(line ~2429), get_block_selection (line 3556), and corresponding async variants. When
not None, override the config field for that call.
Reused primitives
_shards_initialized (src/zarr/core/array.py:3941) — pattern for list_prefix +
intersection with expected keys. The new helper returns a frozenset for O(1) membership.
_iter_shard_keys (src/zarr/core/array.py:5186) — already abstracts sharded vs.
non-sharded key iteration.
metadata.encode_chunk_key() (src/zarr/core/metadata/v3.py:584) — used to convert
chunk coordinates to store keys for membership lookup.
fill_value_or_default() (used in codec_pipeline.py:377, 407) — the existing
fill-value materialization used by the missing-chunk path.
_relativize_path() (used in array.py:3968) — for prefix-relative key normalization.
Sharding interaction
For the sparse-shards case (e.g. 192 shards, 5 populated): list_prefix returns 5 shard keys;
the indexer enumerates all 192 shards × inner chunks; we filter at the shard level and the
inner-chunk loop fills fill_value for the 187 empty shards without invoking ShardingCodec.
For populated shards with sparse inner chunks: no change — ShardingCodec's existing
partial-decode path is already efficient (reads shard index once, O(1) per inner chunk).
Cost-model caveat
list_prefix cost varies by store (LocalStore: ms; FsspecStore on S3: 50–500 ms paginated;
HTTP-only stores: may not implement listing well). On a fully-dense store on a slow LIST
backend, the prefetch is pure overhead — hence the opt-in default. Document this in the
config docstring with a rule of thumb: "enable for sparse arrays where most chunks are
empty, or any time you're reading the whole array on a store with cheap LIST."
Verification
Tests to add (in tests/test_indexing.py and tests/test_array.py)
- Correctness vs. baseline: same array read with flag on/off → byte-identical output.
Parametrize over: sparse 1D, dense 1D, sparse 2D, sharded sparse, sharded dense, all-empty,
all-populated, integer-fill-value, NaN-fill-value, structured dtype.
- Selection types:
arr[:], arr[10:1000], arr.get_block_selection((3,)),
arr.get_orthogonal_selection(...) — all respect the flag.
- Activation paths: global config context manager,
ArrayConfig field, per-call kwarg —
all produce identical results.
read_missing_chunks=False interaction: missing chunks raise the same
ChunkNotFoundError regardless of flag.
- Race tolerance: write a chunk between
list_prefix and the read call; the read still
resolves correctly (either reads new data or fills with fill_value — both acceptable;
today's arr[:] has the same race).
- Store coverage: run the full matrix on
MemoryStore, LocalStore, FsspecStore
(memory backend), ZipStore. Skip stores that don't support list_prefix.
Benchmarks (in bench/)
Add a short bench/empty_chunks.py reproducing the report's setup at smaller scale (e.g.
1,024-chunk grid, 32 populated) on MemoryStore and LocalStore. Report wall time for
flag-off vs. flag-on. Target: ≥10× speedup on sparse, ≤5% regression on dense.
Manual verification
git checkout -b feat/prefetch-populated-keys
# implement
hatch env run -e test pytest tests/test_indexing.py tests/test_array.py -k \"prefetch or sparse\" -x
hatch env run -e test pytest tests/ # full suite — no regressions
python bench/empty_chunks.py
News fragment
Add changes/<PR>.feature.md (the repo uses towncrier-style fragments; see
changes/3679.feature.md). One-line summary referencing the config key.
Branch and PR
- Branch:
feat/prefetch-populated-keys off main.
- PR title:
feat: opt-in empty-chunk-aware read path via prefetch_populated_keys.
- Open an issue first — the contributing guide
(docs/contributing.md, "Enhancement proposals" + "AI-assisted contributions" sections)
explicitly asks for issue-first discussion of new features and large changes. Reference the
zagg report and benchmark numbers in the issue.
- Keep the diff small (the description above is ~150 lines of code + tests). The contributing
guide notes that large AI-assisted PRs may be closed for reviewability.
Out of scope (explicitly deferred)
- Higher default
async.concurrency (3× dense-read win in the report). Separate PR.
- New
Array.fast_read_sparse() public method. Trivially built on top of the flag if
needed later; not committed in this PR.
- Default-on behavior. Opt-in for now; revisit after real-world feedback and store-cost
data.
dask.array.from_zarr() graph pruning (report §6.4) — downstream in dask, not zarr-python.
- xdggs MOC-from-populated-chunks (report §6.2) — downstream in xdggs.
Empty-chunk-aware fast read path for zarr-python
Context
The zagg benchmark report (REPORT.md) and notebook
(layout_access_numpy.ipynb)
identify an upstream optimization in zarr-python's read path. On a sparse 1D HEALPix array
(49,152 chunks, ~1,300 populated), a full read takes 173 s — of which ~150 s is spent in
Python iteration over empty chunks that resolve to
fill_valuewith zero I/O. A 30-linerecipe (
fast_read_sparse) that LISTs populated chunks first and reads only those is 64×faster on sparse, 3× faster on dense (the dense win comes from higher concurrency, which
is being deferred to a separate PR per user direction).
The proposal: bring this optimization upstream into zarr-python as an opt-in flag. xarray
and dask call
__getitem__andget_basic_selection, so a flag wired into the existingread path benefits the whole ecosystem when activated via
zarr.config.set(...)contextmanager — without committing to a new public method or a default-on behavior change.
User decisions confirmed before writing this plan:
Array.fast_read_sparse()method.async.concurrency=10stays untouched in this PR.Design
What the optimization does
Before launching reads for a selection, issue one
store.list_prefix(array_path)call tobuild a
frozenset[str]of populated chunk/shard keys. For each chunk in the selection:fill_valuedirectly,skipping
byte_getter.get()and codec decode. This is semantically identical to today's"missing chunk" path, just without the round-trip.
Filtering is done at the shard-key level (using
array._iter_shard_keys()semantics)so it works uniformly for sharded and non-sharded arrays. For sharded arrays this skips
empty shards; the existing per-shard partial-decode already handles empty chunks within
a populated shard efficiently.
When
read_missing_chunks=Falseis set, missing chunks still raiseChunkNotFoundError—the optimization just detects missingness via the populated set instead of a failed GET.
API surface
A single flag, exposed in three equivalent ways (all wired to one internal code path):
zarr.config.set({\"array.prefetch_populated_keys\": True})— context-managedactivation. xarray/dask inherit this for free.
ArrayConfigfield:prefetch_populated_keys: boolalongsideread_missing_chunks/write_empty_chunks. Set at array open viawith_config(...)or constructor.arr.get_basic_selection(..., prefetch_populated_keys=True)andarr.get_block_selection(..., prefetch_populated_keys=True). Overrides config for one call.Default: off. The optimization is purely additive; off-path behavior is unchanged.
Files to modify
src/zarr/core/config.py— addarray.prefetch_populated_keys: Falsedefault (line 96 region).src/zarr/core/array_spec.py— addprefetch_populated_keys: boolfield toArrayConfigArrayConfigParams. Mirror the existingread_missing_chunksplumbing (lines 22–63).src/zarr/core/array.py:_get_selection(line 5373): when the flag is on, build the populated-key set viastore_path.store.list_prefix(store_path.path)(mirror_shards_initializedat line 3941),then split
indexed_chunksintopresent_chunksandmissing_chunks. Pass onlypresent_chunkstocodec_pipeline.read(). Formissing_chunks, writefill_valuedirectly into
out_buffer[out_selection].read_missing_chunks=Falseerror-construction logic that already exists atlines 5466–5480 — populated-set lookup is the same signal as the current
status=\"missing\"return.
prefetch_populated_keys: bool | None = Nonekwarg toget_basic_selection(line ~2429),
get_block_selection(line 3556), and corresponding async variants. Whennot None, override the config field for that call.
Reused primitives
_shards_initialized(src/zarr/core/array.py:3941) — pattern forlist_prefix+intersection with expected keys. The new helper returns a
frozensetfor O(1) membership._iter_shard_keys(src/zarr/core/array.py:5186) — already abstracts sharded vs.non-sharded key iteration.
metadata.encode_chunk_key()(src/zarr/core/metadata/v3.py:584) — used to convertchunk coordinates to store keys for membership lookup.
fill_value_or_default()(used incodec_pipeline.py:377, 407) — the existingfill-value materialization used by the missing-chunk path.
_relativize_path()(used inarray.py:3968) — for prefix-relative key normalization.Sharding interaction
For the sparse-shards case (e.g. 192 shards, 5 populated): list_prefix returns 5 shard keys;
the indexer enumerates all 192 shards × inner chunks; we filter at the shard level and the
inner-chunk loop fills
fill_valuefor the 187 empty shards without invoking ShardingCodec.For populated shards with sparse inner chunks: no change —
ShardingCodec's existingpartial-decode path is already efficient (reads shard index once, O(1) per inner chunk).
Cost-model caveat
list_prefixcost varies by store (LocalStore: ms; FsspecStore on S3: 50–500 ms paginated;HTTP-only stores: may not implement listing well). On a fully-dense store on a slow LIST
backend, the prefetch is pure overhead — hence the opt-in default. Document this in the
config docstring with a rule of thumb: "enable for sparse arrays where most chunks are
empty, or any time you're reading the whole array on a store with cheap LIST."
Verification
Tests to add (in
tests/test_indexing.pyandtests/test_array.py)Parametrize over: sparse 1D, dense 1D, sparse 2D, sharded sparse, sharded dense, all-empty,
all-populated, integer-fill-value, NaN-fill-value, structured dtype.
arr[:],arr[10:1000],arr.get_block_selection((3,)),arr.get_orthogonal_selection(...)— all respect the flag.ArrayConfigfield, per-call kwarg —all produce identical results.
read_missing_chunks=Falseinteraction: missing chunks raise the sameChunkNotFoundErrorregardless of flag.list_prefixand the read call; the read stillresolves correctly (either reads new data or fills with
fill_value— both acceptable;today's
arr[:]has the same race).MemoryStore,LocalStore,FsspecStore(memory backend),
ZipStore. Skip stores that don't supportlist_prefix.Benchmarks (in
bench/)Add a short
bench/empty_chunks.pyreproducing the report's setup at smaller scale (e.g.1,024-chunk grid, 32 populated) on
MemoryStoreandLocalStore. Report wall time forflag-off vs. flag-on. Target: ≥10× speedup on sparse, ≤5% regression on dense.
Manual verification
News fragment
Add
changes/<PR>.feature.md(the repo uses towncrier-style fragments; seechanges/3679.feature.md). One-line summary referencing the config key.Branch and PR
feat/prefetch-populated-keysoffmain.feat: opt-in empty-chunk-aware read path via prefetch_populated_keys.(
docs/contributing.md, "Enhancement proposals" + "AI-assisted contributions" sections)explicitly asks for issue-first discussion of new features and large changes. Reference the
zagg report and benchmark numbers in the issue.
guide notes that large AI-assisted PRs may be closed for reviewability.
Out of scope (explicitly deferred)
async.concurrency(3× dense-read win in the report). Separate PR.Array.fast_read_sparse()public method. Trivially built on top of the flag ifneeded later; not committed in this PR.
data.
dask.array.from_zarr()graph pruning (report §6.4) — downstream in dask, not zarr-python.