Epic: ArrowReader performance improvements for split FileScanTasks

### What's the feature are you trying to implement?

When running Iceberg workloads via DataFusion Comet, we're seeing significant per-file overhead in iceberg-rust's `ArrowReader`. Comet is an accelerator for Spark workloads that relies on Iceberg Java for scan planning. Iceberg Java generates `FileScanTask`s that are split by byte range and bin-packed across partitions for parallelism. Comet serializes these over to a native operator backed by iceberg-rust's `ArrowReader`.

iceberg-rust's `ArrowReader` treats each `FileScanTask` as a fully independent operation, with no reuse of I/O clients, metadata, or operators across tasks. With many thousands of tasks (common for large tables, or degenerate cases of partitioned tables with thousands of small data files), the per-task overhead becomes a significant fraction of total CPU time. Flame graphs show ~30% of executor CPU in per-file overhead: operator creation, `stat()` calls, TLS handshakes, and credential provider initialization.

I currently have a [branch](https://github.com/mbutrovich/iceberg-rust/tree/reader_perf) with all of these optimizations, and have seen good benefits for our workloads ([DataFusion Comet PR](https://github.com/apache/datafusion-comet/pull/3551)). The optimizations below are independent and will be submitted as separate PRs, and I will benchmark them independently.

### 1. OpenDAL operator caching

**Problem:** [`create_operator()`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/io/storage/opendal/mod.rs#L271-L364) builds a new `Operator` on every call. Every storage operation (`metadata()`, `reader()`, etc.) calls `create_operator()`, so each `FileScanTask` creates at least 2 operators. Each operator construction involves credential provider initialization, signer creation, and wrapping in a [`RetryLayer`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/io/storage/opendal/mod.rs#L362). These are all discarded after a single use.

**Fix:** Cache operators by bucket/container name within `OpenDalStorage` (*e.g.*, in a `HashMap<String, Operator>` behind a `Mutex`). The operator is stateless with respect to individual files — it's scoped to a bucket — so it's safe to reuse across concurrent tasks.

**Scope:** [`crates/iceberg/src/io/storage/opendal/mod.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/io/storage/opendal/mod.rs) — the `OpenDalStorage` enum variants (S3, Gcs, Oss, Azdls) and `create_operator()`.

### 2. File size passthrough via `FileScanTask`

**Problem:** [`create_parquet_record_batch_stream_builder()`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L472-L493) calls [`try_join!(parquet_file.metadata(), parquet_file.reader())`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L481-L482). The `metadata()` call issues a `stat()` / HEAD request to object storage just to get the file size. This file size is already known from the manifest entry's `file_size_in_bytes` field, but [`FileScanTask`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/scan/task.rs#L52-L110) doesn't carry it.

**Fix:** Add `file_size_in_bytes: u64` to `FileScanTask` (populated from the manifest entry during planning), and pass it through to skip the `stat()` call. When present, construct `FileMetadata { size }` directly instead of issuing a storage round-trip.

**Scope:** [`crates/iceberg/src/scan/task.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/scan/task.rs) (add field), [`crates/iceberg/src/arrow/reader.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L472-L493) (use it), and scan planning code that constructs `FileScanTask`s.

### 3. Eliminate double-open for migrated tables

**Problem:** [`process_file_scan_task()`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L201-L279) calls [`create_parquet_record_batch_stream_builder()`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L217-L223) once to inspect the Parquet schema for embedded field IDs. If the file lacks field IDs ([`missing_field_ids`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L228-L233) is true), it calls [`create_parquet_record_batch_stream_builder()` a *second* time](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L269-L275) with `ArrowReaderOptions` to apply name mapping or fallback IDs. Each call issues `metadata()` + `reader()` to storage. For migrated tables, this means 4 operator creations and 2 full metadata fetches per task.

**Fix:** Separate metadata loading from stream builder construction. Load `ParquetMetaData` once, inspect it in-memory for field IDs using `ArrowReaderMetadata::try_new()`, then construct a single `ParquetRecordBatchStreamBuilder` via `new_with_metadata()` with the correct options.

**Scope:** [`crates/iceberg/src/arrow/reader.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L201-L279) — restructure `process_file_scan_task()` and `create_parquet_record_batch_stream_builder()`.

### 4. Parquet metadata prefetch with size hint

**Problem:** Without a metadata size hint, arrow-rs's Parquet reader does two round-trips to read the footer: first an 8-byte read to get the magic bytes and footer length, then a second read for the actual footer. On object storage, each round-trip has significant latency.

**Fix:** Use [`ArrowFileReader::with_metadata_size_hint()`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L1720-L1723) (which wraps arrow-rs's [`ParquetMetaDataReader::with_prefetch_hint()`](https://github.com/apache/arrow-rs/blob/main/parquet/src/file/metadata/reader.rs#L182-L184)) to speculatively read a larger chunk from the end of the file in a single request. DataFusion [uses 512KB as the default](https://github.com/apache/datafusion/blob/17d770d6e53400db0c3d46bde330bf0cdffcce6e/datafusion/common/src/config.rs#L706). The exact value can be discussed in the PR.

**Scope:** [`crates/iceberg/src/arrow/reader.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs#L472-L493) — in `create_parquet_record_batch_stream_builder()` when constructing `ArrowFileReader`.

### 5. Parquet metadata caching across same-file splits

**Problem:** When Iceberg Java splits a large file into multiple `FileScanTask`s (different `start`/`length` byte ranges of the same file), iceberg-rust re-fetches the Parquet metadata for each split independently. The metadata is identical across splits of the same file.

**Fix:** Cache `ParquetMetaData` by `(file_path, should_load_page_index)` within a `read()` invocation using a bounded LRU cache (e.g., moka). Subsequent tasks for the same file get a cache hit and skip all metadata I/O. Use `ParquetRecordBatchStreamBuilder::new_with_metadata()` to pass the cached metadata to the builder.

**Scope:** [`crates/iceberg/src/arrow/reader.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs) — new `ParquetMetadataCache` struct shared across tasks via `ArrowReader::read()`.

**Note:** The benefit of this optimization depends on how the external planner groups tasks. In Spark's default configuration, bin-packing rarely places multiple splits of the same file in the same partition, so cache hits may be infrequent. However, when iceberg-rust plans its own tasks (whole-file), or when external planners explicitly coalesce same-file splits, this provides a measurable reduction in metadata I/O. We tried this already and it didn't make a huge difference on our workload with many small files (https://github.com/apache/iceberg-rust/pull/2100).

### 6. Range coaleasing in reads

**Problem:** We're seeing bad TPC-H SF1000 performance when doing large table scans. I am concerned it's because we don't coalesce ranges.

**Fix:** Adopt object_store's range coalescing to reduce S3 requests.

**Scope:** [`crates/iceberg/src/arrow/reader.rs`](https://github.com/apache/iceberg-rust/blob/88fdfedf/crates/iceberg/src/arrow/reader.rs) — new methods to coalesce byte ranges with reasonable defaults.

### Commentary

2, 3, 4, and seem the least controversial to me. I will start with those first. 1 there was some discussion in Iceberg Slack about implementing after all the `Storage` trait work is done (@CTTY). 5 I added the note that this might not be very effective, and maybe not worth the complication. I will measure it independently and see how it does.

### Willingness to contribute

I can contribute to this feature independently

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: ArrowReader performance improvements for split FileScanTasks #2172

What's the feature are you trying to implement?

1. OpenDAL operator caching

2. File size passthrough via `FileScanTask`

3. Eliminate double-open for migrated tables

4. Parquet metadata prefetch with size hint

5. Parquet metadata caching across same-file splits

6. Range coaleasing in reads

Commentary

Willingness to contribute

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Epic: ArrowReader performance improvements for split FileScanTasks #2172

Description

What's the feature are you trying to implement?

1. OpenDAL operator caching

2. File size passthrough via FileScanTask

3. Eliminate double-open for migrated tables

4. Parquet metadata prefetch with size hint

5. Parquet metadata caching across same-file splits

6. Range coaleasing in reads

Commentary

Willingness to contribute

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. File size passthrough via `FileScanTask`