From 6af5702977cefd8540b0904b88a1dff19b99a3c2 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 19:34:07 +0800 Subject: [PATCH 1/9] first draft of rfc --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 77 +++++++++++++++++++ 1 file changed, 77 insertions(+) create mode 100644 core/core/src/docs/rfcs/7127_foyer_chunked.md diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md new file mode 100644 index 000000000000..14d0062cd867 --- /dev/null +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -0,0 +1,77 @@ +- Proposal Name: `foyer_chunked` +- Start Date: 2026-01-01 +- RFC PR: [apache/opendal#6370](https://github.com/apache/opendal/pull/7127) +- Tracking Issue: [apache/opendal#6372](https://github.com/apache/opendal/issues/6372) + +## Summary + +Introduce chunked-cache support for `FoyerLayer`. + +## Motivation + +In https://github.com/apache/opendal/pull/6366, we introduced the first version of `FoyerLayer`, which allows users to add hybrid caching capability (memory + disk) out of the box. + +The current implementation caches the entire object as a single unit in the hybrid cache. While this approach works well for small objects and is straightforward to implement correctly, it faces limitations when dealing with large objects: + +- **Memory pressure**: Caching large objects (e.g., multi-GB files) as a whole can quickly exhaust the in-memory cache, causing frequent evictions and poor cache hit rates. +- **Bandwidth waste**: Reading a small portion of a large cached object requires loading the entire object from disk cache into memory, wasting I/O bandwidth. +- **Cache efficiency**: Large objects have lower reuse probability compared to frequently accessed smaller chunks within them. + +Many real-world workloads exhibit partial read patterns - reading specific ranges of large files rather than entire files. For example, querying Parquet files often only reads specific column chunks, and video streaming typically accesses sequential segments. + +To address these issues, we propose introducing chunked cache support to FoyerLayer. By splitting large objects into fixed-size chunks and caching them independently, we can achieve better cache utilization and support efficient range reads. + +## Guide-level explanation + +Chunked cache mode allows `FoyerLayer` to cache large objects more efficiently by splitting them into fixed-size chunks instead of caching entire objects. + +### When to use chunked cache + +Use chunked cache when: +- You work with large objects (hundreds of MBs to GBs) that are rarely read entirely +- Your workload performs frequent range reads or partial reads +- You want to maximize cache hit rates for large files with limited cache capacity + +Continue using whole-object cache mode when: +- Most objects are small (< 10 MB) +- Objects are typically read in full +- You prioritize simplicity over fine-grained cache control + +### How it works from a user perspective + +When chunked cache is enabled: + +1. **Configuration**: Set the chunk size (e.g., 64 MB) when creating the `FoyerLayer`. This determines the granularity of caching. + +2. **Transparent operation**: Reading works exactly as before from the API perspective. Call `read()` or `read_with()` with any range. + +3. **Efficient caching**: When you read a range (e.g., bytes 100 MB - 150 MB from a file with 64 MB chunk size), the system: + - Calculates which chunks overlap with the requested range (chunk 1: 64-128 MB, chunk 2: 128-192 MB) + - Fetches and caches these complete chunks in the hybrid cache + - Returns only the exact range you requested (100-150 MB) by slicing the cached chunks + - Future reads to bytes 64-100 MB or 150-192 MB can be served entirely from cache + - Each chunk is cached and evicted independently based on the hybrid cache's LRU policy + +4. **Example**: + ```rust + // Enable chunked cache with 64 MB chunks + let layer = FoyerLayer::new() + .with_chunk_size_bytes(64 * 1024 * 1024); // 64 MB chunks + + let op = Operator::new(S3::default())? + .layer(layer) + .finish(); + + // Read a range from a 1 GB file + let data = op.read_with("large_file.bin") + .range(100_000_000..150_000_000) // Read 50 MB starting at 100 MB + .await?; + + // Only the chunks overlapping this range (chunk 1 and chunk 2) are cached + // Next time you read bytes 120-180 MB, chunk 2 is already in cache + ``` + +### Trade-offs + +- **Pros**: Better cache utilization for large objects, supports efficient partial reads, reduced memory pressure +- **Cons**: Slightly more complex cache management, small overhead for tracking chunks & metadata. \ No newline at end of file From 6cd9ccb2bbf64e6902a889b8342216d402544511 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 20:40:41 +0800 Subject: [PATCH 2/9] vibe the content of doc --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 401 +++++++++++++++++- 1 file changed, 400 insertions(+), 1 deletion(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index 14d0062cd867..67173f8b1080 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -74,4 +74,403 @@ When chunked cache is enabled: ### Trade-offs - **Pros**: Better cache utilization for large objects, supports efficient partial reads, reduced memory pressure -- **Cons**: Slightly more complex cache management, small overhead for tracking chunks & metadata. \ No newline at end of file +- **Cons**: Slightly more complex cache management, small overhead for tracking chunks & metadata. + +## Reference-level explanation + +This section describes the technical implementation details of chunked cache support in `FoyerLayer`. The design is based on SlateDB's `CachedObjectStore` implementation, which has proven effective in production workloads. + +### Configuration + +Extend `FoyerLayer` with a new optional configuration parameter: + +```rust +pub struct FoyerLayer { + cache: HybridCache, + chunk_size_bytes: Option, // None = whole-object mode (default) +} + +impl FoyerLayer { + pub fn with_chunk_size_bytes(mut self, size: usize) -> Self { + // Validate chunk size is aligned with 1KB + assert!(size > 0 && size.is_multiple_of(1024), + "chunk_size_bytes must be > 0 and aligned to 1KB"); + self.chunk_size_bytes = Some(size); + self + } +} +``` + +When `chunk_size_bytes` is `None`, the layer operates in whole-object mode (current behavior). When set to a value, chunked cache mode is enabled. + +### Cache Key Design + +Chunked cache uses structured cache keys encoded with bincode for type safety and efficiency. The chunk size and object version must be included in the cache key to prevent data corruption when chunk size changes or object is updated. + +```rust +#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)] +enum CacheKey { + /// Metadata cache entry for an object + Metadata { + path: String, + chunk_size: usize, + version: Option, + }, + /// Chunk data cache entry + Chunk { + path: String, + chunk_size: usize, + chunk_index: usize, + version: Option, + }, +} +``` + +**Usage examples**: + +```rust +// Metadata key for "data/large.bin" with 64MB chunks and etag +let meta_key = CacheKey::Metadata { + path: "data/large.bin".to_string(), + chunk_size: 67108864, + version: Some("\"abc123def\"".to_string()), // etag from object metadata +}.to_bytes(); + +// Chunk key for chunk 0 +let chunk_key = CacheKey::Chunk { + path: "data/large.bin".to_string(), + chunk_size: 67108864, + chunk_index: 0, + version: Some("\"abc123def\"".to_string()), +}.to_bytes(); +``` + +### Metadata Structure + +Beside the cache key, we also need to cache the metadata of the object. Since tbd + +```rust +#[derive(Serialize, Deserialize)] +struct ObjectMetadata { + size: u64, + etag: Option, + last_modified: Option, +} +``` + +### Read Operation Implementation + +The read operation follows this flow (inspired by SlateDB's design): + +1. **Check chunked mode**: If `chunk_size_bytes` is `None`, fallback to whole-object caching (current implementation). + +2. **Prefetch with aligned range** (key optimization): + ```rust + async fn maybe_prefetch_range( + &self, + path: &str, + range: Option>, + ) -> Result { + let chunk_size = self.chunk_size_bytes.unwrap(); + + // First, try to get cached metadata to obtain version + // We try with version=None first since we don't know the version yet + let meta_key_without_version = CacheKey::Metadata { + path: path.to_string(), + chunk_size, + version: None, + }.to_bytes(); + + if let Some(cached_meta) = self.cache.get(&meta_key_without_version).await { + return deserialize_metadata(cached_meta); + } + + // Align the range to chunk boundaries for efficient prefetching + let aligned_range = range.map(|r| self.align_range(&r)); + + // Fetch from underlying storage with aligned range + // This fetches MORE data than requested, but aligned to chunk boundaries + // Example: User requests 100-150MB, we fetch 64-192MB (chunks 1,2) + let (rp, mut reader) = self.inner.read(path, aligned_range).await?; + + // Extract version (etag) from response + let version = rp.metadata().etag().map(String::from); + + // Save metadata with version + let metadata = ObjectMetadata { + size: rp.metadata().content_length(), + etag: version.clone(), + last_modified: rp.metadata().last_modified(), + }; + + let meta_key = CacheKey::Metadata { + path: path.to_string(), + chunk_size, + version: version.clone(), + }.to_bytes(); + + self.cache.insert(meta_key, serialize_metadata(&metadata)?).await; + + // Stream data and save chunks (with version) + self.save_chunks_from_stream(path, reader, aligned_range.start, version).await?; + + Ok(metadata) + } + + fn align_range(&self, range: &Range) -> Range { + let chunk_size = self.chunk_size_bytes.unwrap() as u64; + let start_aligned = range.start - (range.start % chunk_size); + let end_aligned = range.end.div_ceil(chunk_size) * chunk_size; + start_aligned..end_aligned + } + ``` + + **Why alignment matters**: When object is not yet cached, aligning the range allows us to fetch complete chunks in a single request. For example, if user requests bytes 100MB-150MB with 64MB chunks, we fetch 64MB-192MB in one request and save chunks 1 and 2. Future reads to any part of chunks 1 or 2 will hit cache. + + **Version handling**: The version (etag) is obtained from the read response and included in all cache keys. This ensures that when an object is updated (etag changes), old cached chunks won't be used. + +3. **Split range into chunks**: + ```rust + fn split_range_into_chunks( + &self, + range: Range, + object_size: u64, + ) -> Vec<(ChunkIndex, Range)> { + let chunk_size = self.chunk_size_bytes.unwrap() as u64; + let range_aligned = self.align_range(&range); + + let start_chunk = (range_aligned.start / chunk_size) as usize; + let end_chunk = (range_aligned.end / chunk_size) as usize; + + let mut chunks: Vec<_> = (start_chunk..end_chunk) + .map(|chunk_idx| (chunk_idx, 0..self.chunk_size_bytes.unwrap())) + .collect(); + + // Adjust first chunk's start offset + if let Some(first) = chunks.first_mut() { + first.1.start = (range.start % chunk_size) as usize; + } + + // Adjust last chunk's end offset (handle unaligned end) + if let Some(last) = chunks.last_mut() { + if range.end % chunk_size != 0 { + last.1.end = (range.end % chunk_size) as usize; + } + // Handle last chunk of object being smaller than chunk_size + let chunk_global_end = ((last.0 + 1) as u64 * chunk_size).min(object_size); + let actual_chunk_size = (chunk_global_end - last.0 as u64 * chunk_size) as usize; + last.1.end = last.1.end.min(actual_chunk_size); + } + + chunks + } + ``` + +4. **Read chunks individually**: + ```rust + async fn read_chunk( + &self, + path: &str, + chunk_idx: usize, + range_in_chunk: Range, + version: Option, // Object version from metadata + ) -> Result { + let chunk_size = self.chunk_size_bytes.unwrap(); + let chunk_key = CacheKey::Chunk { + path: path.to_string(), + chunk_size, + chunk_index: chunk_idx, + version: version.clone(), + }.to_bytes(); + + // Try cache first + if let Some(cached_chunk) = self.cache.get(&chunk_key).await { + return Ok(cached_chunk.slice(range_in_chunk)); + } + + // Cache miss - fetch entire chunk from underlying storage + let chunk_range = Range { + start: chunk_idx as u64 * chunk_size as u64, + end: (chunk_idx + 1) as u64 * chunk_size as u64, + }; + + let (_, mut reader) = self.inner.read(path, Some(chunk_range)).await?; + let chunk_data = reader.read_all().await?; + + // Save to cache (best-effort, ignore errors) + self.cache.insert(chunk_key, chunk_data.clone()).await.ok(); + + Ok(chunk_data.slice(range_in_chunk)) + } + ``` + +5. **Assemble result as stream**: + - Return chunks as a `Stream>` rather than buffering all data + - Each chunk is fetched lazily when the stream is polled + - This reduces memory pressure and allows streaming large ranges efficiently + +### Write Operation Implementation + +Following SlateDB's pattern, write operations can optionally cache the written data: + +```rust +async fn write(&self, path: &str, args: OpWrite) -> Result { + // Write to underlying storage first + let result = self.inner.write(path, args.clone()).await?; + + // Optionally cache the written data (can be controlled by a flag) + if self.cache_writes { + // Fetch metadata via stat + if let Ok(meta) = self.inner.stat(path).await { + let metadata = ObjectMetadata::from(meta); + let meta_key = format!("{}#meta", path); + self.cache.insert(meta_key, serialize_metadata(&metadata)?).await; + + // Stream the written data into chunks + // Note: This requires buffering the write payload, which may not be desirable + // For now, we can skip caching write data and only cache on subsequent reads + } + } + + Ok(result) +} +``` + +**Write caching strategy**: +- **Simple approach**: Only invalidate metadata, don't cache write data + - Remove `{path}#meta` from cache + - Let chunks naturally expire via LRU + - Subsequent reads will populate cache +- **Aggressive approach**: Cache written data if enabled + - Useful for write-then-read patterns + - Requires access to write payload (may need buffering) + - Can be controlled via `cache_writes` flag (similar to SlateDB) + +### Delete Operation Implementation + +When a delete completes: + +```rust +async fn delete(&self, path: &str) -> Result { + let result = self.inner.delete(path).await?; + + // Best-effort cache invalidation + // Remove metadata (chunks will be evicted naturally) + let meta_key = format!("{}#meta", path); + self.cache.remove(&meta_key).await.ok(); + + // Optionally: If metadata is in cache, calculate and remove all chunks + // This is more thorough but requires additional cache lookup + + Ok(result) +} +``` + +**Rationale**: Lazy chunk removal is acceptable because: +- Cached chunks for deleted objects are harmless (worst case: wasted cache space) +- They'll be evicted naturally by LRU when cache pressure increases +- Scanning cache for all chunks is expensive and not worth the cost + +### Key Design Decisions + +**Range alignment strategy**: +- When metadata is not cached, align the requested range to chunk boundaries before fetching +- Example: Request 100-150MB with 64MB chunks → fetch aligned 64-192MB +- **Trade-off**: Fetches more data initially, but populates cache more efficiently +- **Benefit**: Reduces number of requests to underlying storage (one aligned request vs. multiple chunk requests) +- Only apply alignment on first fetch (cache miss); subsequent reads use cached chunks + +**Streaming instead of buffering**: +- Return data as a stream rather than loading all chunks into memory +- Each chunk is fetched lazily when consumed +- Matches OpenDAL's streaming API design +- Critical for memory efficiency when reading large ranges + +**Chunk size validation**: +- Require chunk size to be aligned to 1KB (similar to SlateDB) +- Prevents edge cases with very small or misaligned chunks +- Recommended range: 16MB - 128MB + +**Cache operation error handling**: +- All cache operations (insert, remove, get) should be best-effort +- Cache failures should NOT fail the user's read/write operation +- Log warnings for cache errors but continue with fallback to underlying storage +- This ensures cache is truly transparent to users + +### Edge Cases and Considerations + +**Last chunk handling**: +- The last chunk may be smaller than `chunk_size_bytes` +- Calculate actual chunk size: `min((chunk_idx + 1) * chunk_size, object_size) - chunk_idx * chunk_size` +- Example: 200 MB file with 64 MB chunks → chunks 0, 1, 2 (64MB each), chunk 3 (8MB) +- Already handled in `split_range_into_chunks` logic above + +**Empty or invalid range requests**: +- Empty range: Return empty result without cache operations +- Start beyond object size: Return error (per OpenDAL semantics) +- End beyond object size: Clamp end to object size + +**Concurrent access**: +- Foyer's built-in request deduplication handles concurrent reads to the same chunk +- Multiple concurrent reads to chunk N will result in only one fetch from underlying storage +- Other readers wait and reuse the result +- No additional locking needed in FoyerLayer + +**Cache consistency**: +- Cache follows eventual consistency model (same as OpenDAL) +- No distributed coordination for concurrent writes from different processes +- Cache invalidation on write/delete is best-effort +- Acceptable for object storage workloads (most are read-heavy, immutable objects) + +### Performance Characteristics + +**Benefits of aligned prefetching**: +- **Fewer requests**: One aligned request instead of N chunk requests on cache miss + - Example: Request 100-150MB → 1 aligned fetch (64-192MB) vs. 2 separate chunk fetches +- **Better locality**: Neighboring chunks are likely to be accessed together +- **Reduced latency**: Fewer round-trips to underlying storage + +**Memory efficiency**: +- Metadata overhead: ~100-200 bytes per object +- Chunk data follows normal LRU eviction +- Streaming API avoids buffering large ranges in memory +- Each chunk is independently evictable + +**Cache hit rate analysis**: +- **Partial reads**: Significantly improved hit rate + - Chunks are smaller units, higher reuse probability + - Example: Reading different columns of a Parquet file reuses row group chunks +- **Whole-object reads**: Slightly lower hit rate due to fragmentation + - Requires all chunks to be cached vs. one whole-object entry + - Trade-off is acceptable given target workload (partial reads) + +### Testing Strategy + +**Unit tests**: +- `split_range_into_chunks` with various ranges and object sizes +- `align_range` edge cases (aligned, unaligned, boundary conditions) +- Last chunk handling (smaller than chunk_size) +- Empty and invalid ranges + +**Integration tests**: +- End-to-end read with cache hit and miss +- Concurrent reads to same chunk (verify deduplication) +- Write invalidation behavior +- Mixed whole-object and chunked reads + +**Behavior tests**: +- Use existing OpenDAL behavior test suite +- Add chunked cache specific scenarios: + - Large file with range reads + - Sequential read patterns + - Random access patterns + +### Compatibility and Migration + +- **Backward compatible**: Defaults to `chunk_size_bytes = None` (whole-object mode) +- **No breaking changes**: Existing users unaffected +- **Opt-in**: Users explicitly enable chunked mode via configuration +- **Cache format change**: Whole-object cache and chunked cache use different key formats + - No automatic migration needed (cache rebuilds naturally) + - Changing chunk size also invalidates cache (keys change) + - This is acceptable since cache is ephemeral \ No newline at end of file From 195a566352d66c9f6baf3a3f1610b3d9ecff423a Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 21:47:49 +0800 Subject: [PATCH 3/9] make the impl paragraphs more readable --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 227 ++++++------------ 1 file changed, 76 insertions(+), 151 deletions(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index 67173f8b1080..a5e0375cf5e9 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -158,13 +158,13 @@ struct ObjectMetadata { } ``` -### Read Operation Implementation +### Implementation -The read operation follows this flow (inspired by SlateDB's design): +The read operation follows this flow: 1. **Check chunked mode**: If `chunk_size_bytes` is `None`, fallback to whole-object caching (current implementation). -2. **Prefetch with aligned range** (key optimization): +2. **Prefetch with aligned range**: ```rust async fn maybe_prefetch_range( &self, @@ -227,8 +227,6 @@ The read operation follows this flow (inspired by SlateDB's design): **Why alignment matters**: When object is not yet cached, aligning the range allows us to fetch complete chunks in a single request. For example, if user requests bytes 100MB-150MB with 64MB chunks, we fetch 64MB-192MB in one request and save chunks 1 and 2. Future reads to any part of chunks 1 or 2 will hit cache. - **Version handling**: The version (etag) is obtained from the read response and included in all cache keys. This ensures that when an object is updated (etag changes), old cached chunks won't be used. - 3. **Split range into chunks**: ```rust fn split_range_into_chunks( @@ -300,6 +298,7 @@ The read operation follows this flow (inspired by SlateDB's design): // Save to cache (best-effort, ignore errors) self.cache.insert(chunk_key, chunk_data.clone()).await.ok(); + // Return the requested range Ok(chunk_data.slice(range_in_chunk)) } ``` @@ -309,168 +308,94 @@ The read operation follows this flow (inspired by SlateDB's design): - Each chunk is fetched lazily when the stream is polled - This reduces memory pressure and allows streaming large ranges efficiently -### Write Operation Implementation - -Following SlateDB's pattern, write operations can optionally cache the written data: +### Key Design Considerations -```rust -async fn write(&self, path: &str, args: OpWrite) -> Result { - // Write to underlying storage first - let result = self.inner.write(path, args.clone()).await?; - - // Optionally cache the written data (can be controlled by a flag) - if self.cache_writes { - // Fetch metadata via stat - if let Ok(meta) = self.inner.stat(path).await { - let metadata = ObjectMetadata::from(meta); - let meta_key = format!("{}#meta", path); - self.cache.insert(meta_key, serialize_metadata(&metadata)?).await; - - // Stream the written data into chunks - // Note: This requires buffering the write payload, which may not be desirable - // For now, we can skip caching write data and only cache on subsequent reads - } - } +1. **Range alignment strategy** - Ok(result) -} -``` + When metadata is not yet cached, the implementation aligns the requested range to chunk boundaries before fetching from the underlying storage. For example, if a user requests bytes 100-150MB with 64MB chunks configured, the system will fetch the aligned range of 64-192MB. -**Write caching strategy**: -- **Simple approach**: Only invalidate metadata, don't cache write data - - Remove `{path}#meta` from cache - - Let chunks naturally expire via LRU - - Subsequent reads will populate cache -- **Aggressive approach**: Cache written data if enabled - - Useful for write-then-read patterns - - Requires access to write payload (may need buffering) - - Can be controlled via `cache_writes` flag (similar to SlateDB) + While this fetches more data initially, it significantly reduces the number of requests to the underlying storage by consolidating multiple chunk fetches into a single aligned request. This trade-off proves beneficial as it populates the cache more efficiently and reduces overall latency. -### Delete Operation Implementation + The alignment is only applied on the first fetch (cache miss). Subsequent reads can directly use the cached chunks without additional alignment overhead. -When a delete completes: +2. **Streaming result** -```rust -async fn delete(&self, path: &str) -> Result { - let result = self.inner.delete(path).await?; + The implementation returns data as a stream where each chunk is fetched lazily when consumed. - // Best-effort cache invalidation - // Remove metadata (chunks will be evicted naturally) - let meta_key = format!("{}#meta", path); - self.cache.remove(&meta_key).await.ok(); + This approach is critical for memory efficiency when reading large ranges that span many chunks. Without streaming, reading a multi-gigabyte range would require loading all chunks into memory simultaneously, potentially exhausting available memory and causing performance degradation. - // Optionally: If metadata is in cache, calculate and remove all chunks - // This is more thorough but requires additional cache lookup +3. **Best-effort cache operations** - Ok(result) -} -``` + All cache operations (insert, remove, get) are designed to never fail the user's read or write operation. -**Rationale**: Lazy chunk removal is acceptable because: -- Cached chunks for deleted objects are harmless (worst case: wasted cache space) -- They'll be evicted naturally by LRU when cache pressure increases -- Scanning cache for all chunks is expensive and not worth the cost - -### Key Design Decisions - -**Range alignment strategy**: -- When metadata is not cached, align the requested range to chunk boundaries before fetching -- Example: Request 100-150MB with 64MB chunks → fetch aligned 64-192MB -- **Trade-off**: Fetches more data initially, but populates cache more efficiently -- **Benefit**: Reduces number of requests to underlying storage (one aligned request vs. multiple chunk requests) -- Only apply alignment on first fetch (cache miss); subsequent reads use cached chunks - -**Streaming instead of buffering**: -- Return data as a stream rather than loading all chunks into memory -- Each chunk is fetched lazily when consumed -- Matches OpenDAL's streaming API design -- Critical for memory efficiency when reading large ranges - -**Chunk size validation**: -- Require chunk size to be aligned to 1KB (similar to SlateDB) -- Prevents edge cases with very small or misaligned chunks -- Recommended range: 16MB - 128MB - -**Cache operation error handling**: -- All cache operations (insert, remove, get) should be best-effort -- Cache failures should NOT fail the user's read/write operation -- Log warnings for cache errors but continue with fallback to underlying storage -- This ensures cache is truly transparent to users + If a cache operation encounters an error, the implementation logs a warning and continues by falling back to the underlying storage. This ensures that the cache layer remains truly transparent to users. ### Edge Cases and Considerations -**Last chunk handling**: -- The last chunk may be smaller than `chunk_size_bytes` -- Calculate actual chunk size: `min((chunk_idx + 1) * chunk_size, object_size) - chunk_idx * chunk_size` -- Example: 200 MB file with 64 MB chunks → chunks 0, 1, 2 (64MB each), chunk 3 (8MB) -- Already handled in `split_range_into_chunks` logic above - -**Empty or invalid range requests**: -- Empty range: Return empty result without cache operations -- Start beyond object size: Return error (per OpenDAL semantics) -- End beyond object size: Clamp end to object size - -**Concurrent access**: -- Foyer's built-in request deduplication handles concurrent reads to the same chunk -- Multiple concurrent reads to chunk N will result in only one fetch from underlying storage -- Other readers wait and reuse the result -- No additional locking needed in FoyerLayer - -**Cache consistency**: -- Cache follows eventual consistency model (same as OpenDAL) -- No distributed coordination for concurrent writes from different processes -- Cache invalidation on write/delete is best-effort -- Acceptable for object storage workloads (most are read-heavy, immutable objects) - -### Performance Characteristics - -**Benefits of aligned prefetching**: -- **Fewer requests**: One aligned request instead of N chunk requests on cache miss - - Example: Request 100-150MB → 1 aligned fetch (64-192MB) vs. 2 separate chunk fetches -- **Better locality**: Neighboring chunks are likely to be accessed together -- **Reduced latency**: Fewer round-trips to underlying storage - -**Memory efficiency**: -- Metadata overhead: ~100-200 bytes per object -- Chunk data follows normal LRU eviction -- Streaming API avoids buffering large ranges in memory -- Each chunk is independently evictable - -**Cache hit rate analysis**: -- **Partial reads**: Significantly improved hit rate - - Chunks are smaller units, higher reuse probability - - Example: Reading different columns of a Parquet file reuses row group chunks -- **Whole-object reads**: Slightly lower hit rate due to fragmentation - - Requires all chunks to be cached vs. one whole-object entry - - Trade-off is acceptable given target workload (partial reads) +**Last chunk handling** + +The last chunk of an object may be smaller than the configured chunk size and requires special attention. The implementation calculates the actual chunk size using the formula `min((chunk_idx + 1) * chunk_size, object_size) - chunk_idx * chunk_size`. + +For example, a 200 MB file with 64 MB chunks would be split into chunks 0, 1, and 2 of 64MB each, followed by chunk 3 containing only 8MB. + +**Empty or invalid range requests** + +Range requests are handled according to OpenDAL's existing semantics: +- Empty range: Returns empty result without performing any cache operations +- Range start beyond object size: Returns error to match OpenDAL's behavior +- Range end exceeds object size: Clamped to the actual object size, allowing partial reads near the end of objects + +**Concurrent access** + +Concurrent access patterns benefit from Foyer's built-in request deduplication mechanism. When multiple concurrent reads request the same chunk, Foyer ensures that only one fetch actually occurs from the underlying storage, while other readers wait and reuse the result. + +This deduplication happens transparently within the Foyer cache layer, requiring no additional locking or coordination logic in FoyerLayer itself. + +**Cache consistency** + +The cache follows an eventual consistency model aligned with OpenDAL's consistency guarantees. There is no distributed coordination for concurrent writes from different processes, and cache invalidation on write or delete operations is performed on a best-effort basis. + +This relaxed consistency model is acceptable for typical object storage workloads, which are predominantly read-heavy and often involve immutable objects. ### Testing Strategy -**Unit tests**: -- `split_range_into_chunks` with various ranges and object sizes -- `align_range` edge cases (aligned, unaligned, boundary conditions) -- Last chunk handling (smaller than chunk_size) -- Empty and invalid ranges - -**Integration tests**: -- End-to-end read with cache hit and miss -- Concurrent reads to same chunk (verify deduplication) -- Write invalidation behavior -- Mixed whole-object and chunked reads - -**Behavior tests**: -- Use existing OpenDAL behavior test suite -- Add chunked cache specific scenarios: - - Large file with range reads - - Sequential read patterns - - Random access patterns +1. **Unit tests** + + Focus on the core algorithms with various test cases: + - `split_range_into_chunks` with different combinations of ranges and object sizes to verify correct chunk boundary calculations + - `align_range` with aligned ranges, unaligned ranges, and boundary conditions to ensure all edge cases are handled correctly + - Last chunk handling when it's smaller than chunk_size + - Empty and invalid range scenarios + +2. **Integration tests** + + Validate end-to-end behavior of the chunked cache system: + - Cache hit and miss scenarios to ensure prefetching and caching logic works correctly + - Concurrent reads to the same chunk to verify Foyer's request deduplication + - Write invalidation behavior to confirm cached data is properly invalidated when objects are modified + - Mixed workloads using both whole-object mode and chunked mode + +3. **Behavior tests** + + Leverage OpenDAL's existing behavior test suite, which provides comprehensive coverage across different backends. + + Add chunked cache specific scenarios: + - Large files with range reads to validate performance characteristics + - Sequential read patterns to verify prefetching efficiency + - Random access patterns to ensure proper handling of non-sequential workloads ### Compatibility and Migration -- **Backward compatible**: Defaults to `chunk_size_bytes = None` (whole-object mode) -- **No breaking changes**: Existing users unaffected -- **Opt-in**: Users explicitly enable chunked mode via configuration -- **Cache format change**: Whole-object cache and chunked cache use different key formats - - No automatic migration needed (cache rebuilds naturally) - - Changing chunk size also invalidates cache (keys change) - - This is acceptable since cache is ephemeral \ No newline at end of file +**Backward compatibility** + +The chunked cache feature is fully backward compatible with existing FoyerLayer usage. The implementation defaults to `chunk_size_bytes = None`, which activates whole-object mode matching the current behavior. This means existing users are completely unaffected by the introduction of chunked caching. + +**Opt-in design** + +Chunked cache is an opt-in feature that users must explicitly enable through configuration by setting the chunk size. This conservative approach ensures that users who haven't evaluated whether chunked caching benefits their workload will continue to use the proven whole-object caching strategy. + +**Cache key migration** + +The cache key format changes between whole-object and chunked modes, but this requires no special migration handling. Since whole-object cache uses different keys than chunked cache, and different chunk sizes use different keys from each other, old cache entries simply coexist harmlessly with new ones. + +As the LRU eviction policy runs, old entries naturally expire and are replaced with new entries in the current format. This natural invalidation is acceptable because the cache is ephemeral by design, storing temporary performance-optimization data rather than durable state. \ No newline at end of file From 51522a0acbe2ea8a7766178d20fac41de915c988 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 22:00:01 +0800 Subject: [PATCH 4/9] update the description about prefetching --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index a5e0375cf5e9..77523ba2c810 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -164,7 +164,12 @@ The read operation follows this flow: 1. **Check chunked mode**: If `chunk_size_bytes` is `None`, fallback to whole-object caching (current implementation). -2. **Prefetch with aligned range**: +2. **Prefetch with coerced range**: + + Without prefetching, chunked cache would require separate requests for each chunk on cache miss, potentially increasing the total number of requests to the underlying storage. To mitigate this, when an object is not yet cached, the implementation coerces the requested range to chunk boundaries and fetches complete chunks in a single aligned request. + + For example, if a user requests bytes 100MB-150MB with 64MB chunks configured, the implementation fetches the aligned range 64MB-192MB in one request and saves chunks 1 and 2. This consolidates what would otherwise be two separate chunk requests into a single request. Future reads to any part of these cached chunks will hit the cache without additional requests. + ```rust async fn maybe_prefetch_range( &self, @@ -225,9 +230,8 @@ The read operation follows this flow: } ``` - **Why alignment matters**: When object is not yet cached, aligning the range allows us to fetch complete chunks in a single request. For example, if user requests bytes 100MB-150MB with 64MB chunks, we fetch 64MB-192MB in one request and save chunks 1 and 2. Future reads to any part of chunks 1 or 2 will hit cache. - 3. **Split range into chunks**: + ```rust fn split_range_into_chunks( &self, From 339bdcb0ffe204556411493578c6f3dbd6389788 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 22:10:08 +0800 Subject: [PATCH 5/9] improve the writings --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 14 ++++++++++++-- 1 file changed, 12 insertions(+), 2 deletions(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index 77523ba2c810..a4b1e16d5386 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -147,7 +147,13 @@ let chunk_key = CacheKey::Chunk { ### Metadata Structure -Beside the cache key, we also need to cache the metadata of the object. Since tbd +In addition to caching chunks, the implementation must cache object metadata separately. The object size is critical for determining the total number of chunks and calculating the boundary of the last chunk, which may be smaller than the configured chunk size. + +Metadata lookup always precedes chunk operations. Before reading any chunks, the implementation must first retrieve the metadata to obtain the object size. This information is essential for correctly calculating chunk boundaries and ensuring cache consistency. + +If an metadata is not cached, we could regard the object as not cached and fallback to the underlying storage in most cases. + +TODO: take opendal's meta struct. ```rust #[derive(Serialize, Deserialize)] @@ -166,7 +172,9 @@ The read operation follows this flow: 2. **Prefetch with coerced range**: - Without prefetching, chunked cache would require separate requests for each chunk on cache miss, potentially increasing the total number of requests to the underlying storage. To mitigate this, when an object is not yet cached, the implementation coerces the requested range to chunk boundaries and fetches complete chunks in a single aligned request. + When an object is not yet cached, the implementation coerces the requested range to chunk boundaries and prefetches complete chunks in a single aligned request, and cache these chunks. + + Without prefetching, chunked cache would require separate requests for each chunk on cache miss, potentially increasing the total number of requests to the underlying storage, which might increase the cost of API calls on object stores. For example, if a user requests bytes 100MB-150MB with 64MB chunks configured, the implementation fetches the aligned range 64MB-192MB in one request and saves chunks 1 and 2. This consolidates what would otherwise be two separate chunk requests into a single request. Future reads to any part of these cached chunks will hit the cache without additional requests. @@ -232,6 +240,8 @@ The read operation follows this flow: 3. **Split range into chunks**: + After obtaining the object size from metadata in the prefetch phase, the implementation splits the requested range into chunks. For each chunk, it calculates the exact range needed within that chunk, accounting for potential partial ranges at the beginning and end. + ```rust fn split_range_into_chunks( &self, From 91bbda83cb1dc699a63e0844867e6e71cf6f0fd6 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 22:25:29 +0800 Subject: [PATCH 6/9] update the writing about metadata --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 53 ++++++++----------- 1 file changed, 23 insertions(+), 30 deletions(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index a4b1e16d5386..35615ca30898 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -151,16 +151,23 @@ In addition to caching chunks, the implementation must cache object metadata sep Metadata lookup always precedes chunk operations. Before reading any chunks, the implementation must first retrieve the metadata to obtain the object size. This information is essential for correctly calculating chunk boundaries and ensuring cache consistency. -If an metadata is not cached, we could regard the object as not cached and fallback to the underlying storage in most cases. +If metadata is not cached, the object is treated as uncached and the implementation falls back to the underlying storage. -TODO: take opendal's meta struct. +Since OpenDAL's `Metadata` does not currently implement `Serialize/Deserialize`, we define a `CachedMetadata` wrapper that can be serialized to cache: ```rust +/// Wrapper for OpenDAL's Metadata to make it serializable for cache. #[derive(Serialize, Deserialize)] -struct ObjectMetadata { - size: u64, - etag: Option, - last_modified: Option, +struct CachedMetadata { + meta: Metadata, +} + +impl From<&Metadata> for CachedMetadata { + fn from(meta: &Metadata) -> Self { + Self { + meta: meta.clone(), + } + } } ``` @@ -182,20 +189,20 @@ The read operation follows this flow: async fn maybe_prefetch_range( &self, path: &str, + version: Option, range: Option>, - ) -> Result { + ) -> Result { let chunk_size = self.chunk_size_bytes.unwrap(); - // First, try to get cached metadata to obtain version - // We try with version=None first since we don't know the version yet - let meta_key_without_version = CacheKey::Metadata { + // First, try to get cached metadata + let meta_key = CacheKey::Metadata { path: path.to_string(), chunk_size, - version: None, + version: version.clone(), }.to_bytes(); - if let Some(cached_meta) = self.cache.get(&meta_key_without_version).await { - return deserialize_metadata(cached_meta); + if let Some(cached_meta) = self.cache.get(&meta_key).await { + return Ok(cached_meta); } // Align the range to chunk boundaries for efficient prefetching @@ -206,23 +213,9 @@ The read operation follows this flow: // Example: User requests 100-150MB, we fetch 64-192MB (chunks 1,2) let (rp, mut reader) = self.inner.read(path, aligned_range).await?; - // Extract version (etag) from response - let version = rp.metadata().etag().map(String::from); - - // Save metadata with version - let metadata = ObjectMetadata { - size: rp.metadata().content_length(), - etag: version.clone(), - last_modified: rp.metadata().last_modified(), - }; - - let meta_key = CacheKey::Metadata { - path: path.to_string(), - chunk_size, - version: version.clone(), - }.to_bytes(); - - self.cache.insert(meta_key, serialize_metadata(&metadata)?).await; + // Convert OpenDAL's Metadata to CachedMetadata for cache storage + let metadata = CachedMetadata::from(rp.metadata()); + self.cache.insert(meta_key, metadata).await; // Stream data and save chunks (with version) self.save_chunks_from_stream(path, reader, aligned_range.start, version).await?; From ad86ccdc9f13ac70272a144c980739969637aad1 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 22:27:43 +0800 Subject: [PATCH 7/9] tune the description about memory pressure --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index 35615ca30898..8a3fa1ce192a 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -13,7 +13,7 @@ In https://github.com/apache/opendal/pull/6366, we introduced the first version The current implementation caches the entire object as a single unit in the hybrid cache. While this approach works well for small objects and is straightforward to implement correctly, it faces limitations when dealing with large objects: -- **Memory pressure**: Caching large objects (e.g., multi-GB files) as a whole can quickly exhaust the in-memory cache, causing frequent evictions and poor cache hit rates. +- **Memory pressure**: Caching large objects (e.g., multi-GB files) requires to serialize/deserialize the entire object from cache in a single operation, which is likely to be memory consuming. - **Bandwidth waste**: Reading a small portion of a large cached object requires loading the entire object from disk cache into memory, wasting I/O bandwidth. - **Cache efficiency**: Large objects have lower reuse probability compared to frequently accessed smaller chunks within them. From 6be7669dc469193d64abf53e4c9d1b72f6b3131d Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 22:45:11 +0800 Subject: [PATCH 8/9] add note about assumption about immutable --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 9 +++++++-- 1 file changed, 7 insertions(+), 2 deletions(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index 8a3fa1ce192a..ae4b664ae356 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -360,9 +360,14 @@ This deduplication happens transparently within the Foyer cache layer, requiring **Cache consistency** -The cache follows an eventual consistency model aligned with OpenDAL's consistency guarantees. There is no distributed coordination for concurrent writes from different processes, and cache invalidation on write or delete operations is performed on a best-effort basis. +**Important**: The cache is designed with the assumption that objects in the underlying storage are immutable. This means the cache expects that once an object is written, its content at a given path and version will not change. -This relaxed consistency model is acceptable for typical object storage workloads, which are predominantly read-heavy and often involve immutable objects. +When an update or delete operation occurs, the implementation attempts to invalidate the affected metadata and chunks from the cache. However, there are scenarios where stale cache entries may temporarily persist: + +- Concurrent writes from different processes or clients cannot be detected +- Cache invalidation failures are silently ignored to maintain transparency + +This design is suitable for typical object storage workloads where objects are write-once-read-many. **Applications that frequently modify objects at the same path should carefully evaluate whether using this feature is appropriate.** Those requiring strong consistency guarantees should either disable caching or implement application-level cache invalidation mechanisms. ### Testing Strategy From 69f9aae161512e63d288f28dd2f8d98f8cf35627 Mon Sep 17 00:00:00 2001 From: Li Yazhou Date: Thu, 1 Jan 2026 22:49:39 +0800 Subject: [PATCH 9/9] tune the description about immutable, continue --- core/core/src/docs/rfcs/7127_foyer_chunked.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/core/core/src/docs/rfcs/7127_foyer_chunked.md b/core/core/src/docs/rfcs/7127_foyer_chunked.md index ae4b664ae356..e50799f26dc2 100644 --- a/core/core/src/docs/rfcs/7127_foyer_chunked.md +++ b/core/core/src/docs/rfcs/7127_foyer_chunked.md @@ -362,7 +362,7 @@ This deduplication happens transparently within the Foyer cache layer, requiring **Important**: The cache is designed with the assumption that objects in the underlying storage are immutable. This means the cache expects that once an object is written, its content at a given path and version will not change. -When an update or delete operation occurs, the implementation attempts to invalidate the affected metadata and chunks from the cache. However, there are scenarios where stale cache entries may temporarily persist: +When an update or delete operation occurs, the implementation attempts to invalidate the affected metadata and chunks from the cache on a **best-effort basis**. This means invalidation is attempted but not guaranteed, and failures will not prevent the operation from succeeding. Stale cache entries may temporarily persist in scenarios such as: - Concurrent writes from different processes or clients cannot be detected - Cache invalidation failures are silently ignored to maintain transparency