Skip to content

perf: use foldhash for log-replay file dedup set#2698

Open
DrakeLin wants to merge 1 commit into
delta-io:mainfrom
DrakeLin:stack/swap-dedup-hasher
Open

perf: use foldhash for log-replay file dedup set#2698
DrakeLin wants to merge 1 commit into
delta-io:mainfrom
DrakeLin:stack/swap-dedup-hasher

Conversation

@DrakeLin

@DrakeLin DrakeLin commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

What changes are proposed in this pull request?

Swap the std HashSet used to track seen (path, dv_unique_id) pairs during log replay for a HashSet backed by foldhash::fast::RandomState. This is the hottest hash structure on the read path: every Add/Remove across every commit and every action in every checkpoint hits a contains and (for commit batches) an insert. SipHash-1-3 (the std default) is ~2x slower than foldhash on small keys for no functional benefit here.

The alternate hasher is hidden behind a crate-private SeenFileKeys type alias. SerializableScanState.seen_file_keys keeps its existing HashSet<FileActionKey> (std hasher) so the distributed-scan wire format and public API are unchanged; conversion happens once per phase boundary.

foldhash::fast::RandomState reseeds per-process, so HashDoS resistance for path keys read from the Delta log is preserved.

How was this change tested?

  • cargo nextest run -p delta_kernel --all-features (4704 tests pass)
  • cargo clippy --workspace --benches --tests --all-features -- -D warnings
  • cargo doc --workspace --all-features --no-deps

@DrakeLin DrakeLin changed the title test: checkpoint should not contain tombstoned domain metadata (#2365) perf: use foldhash for log-replay file dedup set Jun 5, 2026
@codecov

codecov Bot commented Jun 5, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.85%. Comparing base (0003a4e) to head (00206bd).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2698      +/-   ##
==========================================
- Coverage   88.85%   88.85%   -0.01%     
==========================================
  Files         199      199              
  Lines       64673    64820     +147     
  Branches    64673    64820     +147     
==========================================
+ Hits        57465    57595     +130     
- Misses       5022     5027       +5     
- Partials     2186     2198      +12     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions

github-actions Bot commented Jun 5, 2026

Copy link
Copy Markdown

Benchmark results

Commit: 00206bd · Trigger: auto-push · Tags: base

Test Base PR Change
101kAdds1kCommitsSinceChkpt1Chkpt/readLatest/readMetadata/serial 328.2±4.15ms 325.2±5.78ms 1.01x faster
101kAdds1kCommitsSinceChkpt1Chkpt/readV10/readMetadata/serial 1572.8±44.24µs 1601.7±39.77µs 1.02x slower
101kAdds1kCommitsSinceChkpt1Chkpt/readV110/readMetadata/serial 33.7±0.49ms 33.3±0.79ms 1.01x faster
101kAdds1kCommitsSinceChkpt1Chkpt/readV210/readMetadata/serial 65.8±1.12ms 67.5±4.90ms 1.03x slower
101kAdds1kCommitsSinceChkpt1Chkpt/readV510/readMetadata/serial 163.6±3.11ms 159.1±2.91ms 1.03x faster
101kAdds1kCommitsSinceChkpt1Chkpt/readV60/readMetadata/serial 17.7±0.25ms 17.3±0.26ms 1.02x faster
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotLatest/snapshotConstruction 113.7±2.39ms 114.2±3.08ms 1.00x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV10/snapshotConstruction 29.4±0.20ms 29.4±0.65ms 1.00x
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV110/snapshotConstruction 38.4±0.69ms 37.9±0.42ms 1.01x faster
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV210/snapshotConstruction 47.3±0.75ms 47.1±0.98ms 1.00x faster
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV510/snapshotConstruction 73.5±1.97ms 74.0±1.71ms 1.01x slower
101kAdds1kCommitsSinceChkpt1Chkpt/snapshotV60/snapshotConstruction 33.9±0.36ms 33.6±0.33ms 1.01x faster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant