Skip to content

feat(geotiff)!: exponential read-ahead cache for headers#509

Closed
kylebarron wants to merge 15 commits into
mainfrom
kyle/geotiff-header-multiplier
Closed

feat(geotiff)!: exponential read-ahead cache for headers#509
kylebarron wants to merge 15 commits into
mainfrom
kyle/geotiff-header-multiplier

Conversation

@kylebarron
Copy link
Copy Markdown
Member

@kylebarron kylebarron commented May 6, 2026

This PR is being deprecated and closed in favor of #529, which should require much less total metadata bytes than this. In particular, that relaxes our existing requirement that we pre-fetch the entire TileByteCounts and TileByteOffsets arrays.


Summary

  • Replaces the fixed [SourceChunk, SourceCache] pipeline on GeoTIFF.fromUrl with a single sequential read-ahead cache that grows fetch sizes by a configurable multiplier — port of async-tiff's ReadaheadMetadataCache to TypeScript.
  • New internal mutex() helper (packages/geotiff/src/concurrency.ts) for serializing async tasks across awaits.
  • New SequentialBlockCache + SourceReadaheadCache (packages/geotiff/src/readahead-cache.ts); the latter is a chunkd SourceMiddleware.
  • Breaking: GeoTIFF.fromUrl drops chunkSize and cacheSize options. prefetch is kept (now meaning "initial fetch size") and a new multiplier option is added (defaults: 32 KiB / 2.0).

Closes #500

Spec: dev-docs/specs/2026-05-05-geotiff-readahead-cache-design.md

Test Plan

  • mutex() unit tests — serialization, ordering, error isolation
  • SequentialBlockCache unit tests — single-block, cross-block, empty-buffer, EOF
  • SourceReadaheadCache unit tests — port of upstream's test_readahead_cache, plus bypasses (negative offset, full-file), concurrency serialization, file-size clamping, EOF break
  • Integration smoke test against uint8_rgb_deflate_block64_cog fixture — verifies the middleware integrates with the cogeotiff Tiff reader and triggers cache extension on small initial sizes
  • pnpm --filter @developmentseed/geotiff typecheck clean
  • All 86 tests in the package pass (the pre-existing integration-rasterio.test.ts failures are missing-fixture issues unrelated to this branch)

🤖 Generated with Claude Code

kylebarron and others added 6 commits May 5, 2026 12:36
Spec for replacing the fixed [SourceChunk, SourceCache] header pipeline
with a sequential read-ahead cache that grows fetch sizes by a configurable
multiplier. Ports async-tiff's ReadaheadMetadataCache to TypeScript.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wraps a chunkd Source with a sequential read-ahead cache from offset 0.
Initial fetches start at `initial` bytes and grow by `multiplier` on
each subsequent underlying fetch. Bypasses negative-offset and full-file
reads. Uses mutex() to serialize concurrent cache extension.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drops the chunkSize and cacheSize options on GeoTIFF.fromUrl. Adds a
multiplier option. The prefetch option now means initial fetch size for
the new read-ahead cache.

Closes #500

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kylebarron and others added 3 commits May 6, 2026 11:48
After live-testing PR #509, the cache caused catastrophic over-fetching
when zooming because cogeotiff lazily reads tile-offset/bytecount entries
from the headerSource for previously-untouched IFDs. Spec now adds:
- disable() on SourceReadaheadCache, called after open + prefetchTags
- Lazy per-IFD bulk prefetch on first Overview.fetchTile
- Default prefetch bumped 32 KiB -> 64 KiB

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etch to 64 KiB

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
kylebarron and others added 3 commits May 12, 2026 11:38
…on first tile request

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ultiplier

# Conflicts:
#	packages/geotiff/src/geotiff.ts
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread packages/geotiff/src/source/readahead-cache.ts
Group SourceReadaheadCache and its mutex() helper under a source/
subdirectory to make the chunkd Source middleware layer explicit. No
behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +134 to +142
* Intended to be called once `GeoTIFF.fromUrl` has finished its open-phase
* reads (`Tiff.create` + `prefetchTags(primaryImage)`). At that point the
* sequential read-ahead cache has done its job; subsequent reads from
* cogeotiff are at arbitrary offsets (lazy IFD lookups, GDAL ghost-header
* probes) and do not benefit from sequential-from-zero growth — in fact
* they cause catastrophic over-fetching as the cache grows exponentially
* to encompass each new far-offset request.
*
* Idempotent. One-way: there is no `enable()`.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that we had separate sources for the header reads and the tile reads?

I think perhaps we need some more stable way to decide whether a tiff tag should be chunked with the start of the file or not. If there are other tags defined at the end of the file, and we hadn't called disable yet, then we'd still do a huge over fetch, right?

Copy link
Copy Markdown
Member Author

@kylebarron kylebarron May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Written by Claude.)

By Claude:

Good catch — yes, this was a real risk. Addressed in 631f5b7 with a maxExtension cap on the cache (default 4 MiB). Now, even during the open phase (before freeze() is called), if satisfying a request would require pulling more than the cap into the cache, the middleware bypasses for that request and serves it with one direct fetch instead. So a GDAL ghost-header probe at EOF, or any other far-offset tag value, no longer triggers exponential growth.

You're right that header and tile reads do use separate sources — dataSource (raw) for tile data via fetchTile(), headerSource (cached) for metadata. The hole was that cogeotiff's lazy per-tile offset/bytecount lookups also go through headerSource, which is why we needed both the cap and freeze() (now also via the lazy-prefetch in Overview.ensureTagsLoaded).

Comment thread packages/geotiff/src/geotiff.ts Outdated
Comment thread packages/geotiff/src/geotiff.ts Outdated
Comment on lines +305 to +308
// Open phase complete: scope the cache to the open phase only. From here
// on, all reads (lazy overview tag lookups, GDAL ghost-header probes,
// etc.) bypass the cache and go straight to raw HTTP.
readahead.disable();
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we want a middle ground where we check the existing cache if the new range is already covered, but we also fetch at the end of the file separately if requested

Copy link
Copy Markdown
Member Author

@kylebarron kylebarron May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Written by Claude.)

By claude:

Done in 631f5b7 — renamed disable() to freeze() and changed its semantics to exactly that: cache hits are still served from memory, only misses bypass to the underlying source. A new maxExtension option (default 4 MiB) handles your other concern about far-offset reads — single requests that would require extending the cache by more than the cap bypass instead of triggering exponential growth, even while still active during the open phase.

Comment on lines +129 to +146
/**
* Bulk-load `TileOffsets` and `TileByteCounts` for this overview's data IFD
* (and mask IFD, if present) on first call. Subsequent calls return the
* same memoized promise — no additional underlying fetches.
*
* Why this exists: cogeotiff/core lazily reads individual entries from the
* tile-offset/bytecount arrays via the header source, one 4–8 byte entry
* per tile request. For overviews not pre-loaded by `prefetchTags` (i.e.
* everything except the primary image), this means many tiny per-tile
* range requests on every tile fetch. Calling
* `image.fetch(TiffTag.TileOffsets)` once forces cogeotiff to bulk-load
* the full array; thereafter all per-tile lookups are served from memory.
*
* The bulk fetch goes through the source originally passed to
* `Tiff.create` — for {@link GeoTIFF.fromUrl}, that's the wrapped header
* source. After {@link GeoTIFF.fromUrl} disables its read-ahead cache,
* the wrapper is a pass-through, so this read hits raw HTTP directly.
*/
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we always want to fetch the entire tile offsets arrays for any input...

Copy link
Copy Markdown
Member Author

@kylebarron kylebarron May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By Claude:

Fair concern. The trade-off: the alternative is per-entry lookups (each tile request triggers a 4–8 byte HTTP range for the offset and another for the bytecount), which scales linearly with the number of tiles fetched. For typical tile rendering — where many tiles from one overview are fetched in quick succession — the bulk fetch is one HTTP request and serves all subsequent lookups from memory.

Array sizes are bounded: a 50K×50K image with 256×256 tiles has ~40K tiles → ~320 KB array; even a 100K×100K image is ~1.2 MB. We do this lazily per-overview, so users only pay for overviews they actually render from.

If we want to make this opt-out (e.g. for very specific use cases like a single targeted tile fetch), happy to add a flag. Want me to file a follow-up issue, or do you have a specific case in mind where the bulk fetch is the wrong call?

…efault multiplier 4

Addresses PR review feedback on #509:

- disable() renamed to freeze(): after the open phase finishes, the cache
  still serves hits from memory but stops extending. Misses bypass to the
  underlying source directly. Previously disable() bypassed entirely, which
  forced an extra HTTP request even when the data was already cached.

- New maxExtension option (default 4 MiB): caps how much the cache can
  grow in a single underlying fetch. If satisfying a request would require
  pulling more than the cap, the middleware bypasses for that request
  instead. Bounds the worst case when cogeotiff reads at a far offset
  during the open phase (e.g. GDAL ghost-header probes at EOF).

- Default multiplier bumped 2 -> 4: faster cache expansion during open
  means fewer round trips for files with larger metadata regions. The
  maxExtension cap keeps the worst case bounded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment on lines +95 to +103
/**
* Maximum bytes a single underlying fetch may add to the cache. If
* satisfying a request would require extending the cache by more than this,
* the middleware bypasses the cache entirely for that request and serves it
* with one direct fetch instead. Bounds the worst case when cogeotiff reads
* at a far offset during the open phase (e.g. GDAL ghost-header probes at
* EOF). Defaults to {@link DEFAULT_MAX_EXTENSION}.
*/
maxExtension?: number;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, realistically how big is the largest header metadata we'll have?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 200GB COG has a 61MB header. So I'm not sure if 4MB is enough for a per-fetch cache extension

https://vtopendata-prd.s3.amazonaws.com/Imagery/STATEWIDE_2025_30cm_LeafON_3Band.tif

* cache.len = 2 MB triggers a 4 MB underlying fetch). With it disabled,
* they go straight to raw HTTP and the cache stops mattering.
* The cache is **frozen at the end of the open phase**. Once `Tiff.create`
* and `prefetchTags(primaryImage)` finish, {@link SourceReadaheadCache.freeze}
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh but I see we do always prefetch tile offsets and byte counts for the primary image already...

The previous maxExtension cap (4 MiB) capped the total per-fetch size,
which includes nextFetchSize × multiplier. With multiplier=4, once
cache.len exceeded ~1 MiB the exponential growth alone exceeded the cap,
stalling the readahead cache even for fully sequential reads. This made
the cache useless for files with large headers — e.g. a 200 GB COG with
a 61 MiB header would need ~15 individual fetches instead of ~5
exponentially-growing ones.

Replace with a maxGap cap: bypass only when a request *starts* more than
maxGap bytes past cache.len. Sequential extension is unbounded; only
far-offset probes (e.g. GDAL ghost-header reads near EOF) bypass.
Default 128 MiB — larger than any realistic TIFF metadata region, small
relative to large-COG file sizes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geotiff: multiplier for HTTP range sizes for header metadata

1 participant