refactor(flat): replace persisted-snapshot HSST with single-level sorted table by asdacap · Pull Request #12100 · NethermindEth/nethermind

asdacap · 2026-06-23T07:19:57Z

Part of splitting the large flat/long-finality work into reviewable pieces. This PR isolates the persisted-snapshot on-disk format: it removes the nested columnar HSST and replaces it with a single, deliberately-unoptimized, binary-search sorted table. Trie-node RLP still lives in blob arenas as NodeRefs; only the metadata format changes.

Changes

New format (PersistedSnapshots/Sorted/): SortedTable / SortedTableBuilder / SortedTableReader / SortedTableEnumerator + FORMAT.md. Interleaved [ks u16][key][vs u16][value] records, a u32 offset index, and a footer. The builder buffers all keys and sorts the offsets at Build; the reader binary searches (O(log N), no caching).
Verbose keys (PersistedSnapshotKey): fully-materialized <column><addr><subcolumn><slot> keys with the column/subcolumn tag bytes stored as 255 − tag, so a plain ascending byte sort reproduces the HSST reverse-tag emission order. The builder/compacter ordering and comparers are kept unchanged so a future proper-HSST serializer can reuse them.
Rewired PersistedSnapshotBuilder / PersistedSnapshotReader / PersistedSnapshotMerger / PersistedSnapshotScanner / PersistedSnapshot over the sorted table. Dropped AddressBoundCache and the address-BTree-root caching. The merger preserves newest-wins plus per-address self-destruct truncation.
Removed all HSST format code (Hsst/{BTree,PackedArray,DenseByteIndex,TwoByteSlot}, the dispatchers, the N-way merge plumbing). Kept the seam interfaces (IHsstByteReader/Bound, IByteBufferWriter, IHsstReaderSource, SpanByteReader, PooledByteBufferWriter), the Storage/ arena layer, bloom, and orchestration.
Catalog version bumped 1 → 2: existing long-finality persisted-snapshot directories are unreadable by the new reader and fail fast with a "wipe and resync" message.
Deleted the HSST format unit tests; added SortedTableTests.

Types of changes

What types of changes does your code introduce?

Breaking change (a change that causes existing functionality not to work as expected)
Refactoring

Breaking only for the experimental EnableLongFinality on-disk format (off by default): the persisted_snapshot directory must be wiped.

Testing

Requires testing

Yes

If yes, did you write tests?

Yes

Notes on testing

Full Nethermind.slnx builds. Nethermind.State.Flat.Test is green — 729 existing tests (4 pre-existing skips) plus 4 new SortedTableTests — including the compactor tests that cover newest-wins and per-address self-destruct truncation.

Documentation

Requires documentation update

No

Format documented in-repo at PersistedSnapshots/Sorted/FORMAT.md.

Requires explanation in Release Notes

No

Experimental feature, off by default.

Remarks

Intermediate step: the format is intentionally unoptimized (a full record per slot, buffer-all-keys at build). Proper HSST returns in a later PR, reusing the unchanged builder/compacter ordering.

🤖 Generated with Claude Code

…NWayMergeCursor The per-slot HsstEnumerator was previously a field on each IHsstMergeSource implementation, retrieved via GetEnumerator(). That smeared two concerns — "what bytes to merge" (view) and "where in them" (enumerator + bound) — onto the source struct, and forced every source-construction path (FromView / MapCursorSource via WithEnumerator / IHsstEnumeratorFactory) to handle the HSST layout dispatch (tail-byte vs front-byte two-byte-slot). Reshape: - IHsstMergeSource drops GetEnumerator() and IDisposable, gains a `Bound` property. The source is now a true minimal "what to merge" pair — reader factory + bound. - WholeReadSessionMergeSource stores (view, bound); `WithEnumerator` is replaced by `WithBound`. FromView resolves the column-tag bound and returns a (view, bound) pair, no enumerator construction. - IHsstEnumeratorFactory moves out of PersistedSnapshotMerger to a new file in the Hsst namespace, generic over <TReader, TPin>. - NWayMergeCursor gains a TFactory generic parameter and constructs one HsstEnumerator per slot in its ctor, storing them in a caller-supplied `Span<HsstEnumerator<TReader, TPin>>`. `MinValue` and `ValueAt` read the enumerator directly instead of going through the source. - MapCursorSource collapses to a "rewrite the bound" loop — no factory parameter, no per-source enumerator construction. Every cursor consumer (HsstBTreeMerger.NWayMerge / NWayMergeKeyFirst, HsstPackedArrayMerger.NWayMerge, HsstTwoByteSlotMerger.NWayMerge, IHsstBTreeValueMerger.OnFastCopy / MergeValues) gains a TFactory generic parameter for naming the cursor type; the underlying behaviour is unchanged. Concrete value-merger structs (PerAddressColumnValueMerger, StorageTrieColumnValueMerger, SlotPrefixValueMerger) commit to TFactory = TailDispatchEnumeratorFactory in their interface impls. Every cursor construction site now rents a parallel ArrayPoolList<HsstEnumerator<…>> alongside its sources buffer; the per- source Dispose loops are removed (sources no longer carry disposable enumerators, and HsstEnumerator's Dispose is a no-op anyway). SlotPrefixValueMergerScratch gains an InnerEnumerators ArrayPoolList for the inner 2-byte slot-suffix merge's enumerator buffer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Bound FromView now wraps a view in a source over the entirety of its bytes — a single-line static factory with no HSST-layer logic. The column-tag scope resolution moves to a dedicated ResolveColumnBound(view, columnTag) helper in PersistedSnapshotMerger that returns a Bound. The five seeding loops in NWayMergeSnapshotsWithViews now use the source's primary constructor with the pre-resolved bound: columnSources[i] = new(views[i], ResolveColumnBound(views[i], TAG)); Same behaviour, clearer separation: FromView is just "view → source", and column-tag seeking is a single named operation rather than a hidden step inside the source factory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…uildMergeCursor After the recent FromView simplification (no more columnTag parameter), every actual caller migrated to the primary constructor + ResolveColumnBound, leaving FromView with zero use sites — just a stale doc-cref. Delete it. MapCursorSource had a single caller (BuildMergeCursor's body) and only existed because the previous design separated source-mapping from cursor construction. Fold its three-line for-loop directly into BuildMergeCursor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Eight NativeMemoryListRef rentals in PersistedSnapshotMerger held bounded Span<int>/Span<Bound> scratch (sized by matchCount or slotCapacity, both ≤ snapshot count N ≤ ~32). Converted to stackalloc: - PerAddressColumnValueMerger.MergeValues: perAddrBounds + subTagBounds - PerAddressColumnValueMerger.MergeSlots: slotSources + slotBounds - StorageTrieColumnValueMerger.MergeValues: perAddrBounds + subTagBounds - StorageTrieColumnValueMerger.MergeStorageSubTag: srcs + subBounds Bound is `record struct(long Offset, long Length)` — unmanaged, so stackalloc Bound[n] compiles. Per-call stack budget is well under 1 KiB total across all four stackallocs in any single call frame. The two NativeMemoryListRef rentals in NWayMetadataMerge stay — their sizes (totalRefIdsBytes) are unbounded and could plausibly exceed 2 GiB, which the file comment already calls out. Adding `scoped` to the relevant Span<Bound>/ReadOnlySpan<Bound> parameter on ResolvePerAddrAndSubTagBounds and the per-sub-tag method signatures (MergeSlots/MergeSelfDestruct/MergeAccount/MergeStorageSubTag) is required so the compiler accepts passing stackalloc spans alongside the scoped ref HsstDenseByteIndexBuilder parameter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…opy/MergeValues Only one of three value-merger implementations had a non-empty OnKey body (PerAddressColumnValueMerger: bloom.Add(addrKey)); the other two were literally `{ }`. Removing OnKey from the interface and the driver loop shrinks both ends — the per-address merger gets a single `bloom.Add(addrKey)` at the start of OnFastCopy and MergeValues; the other two impls lose their no-op bodies; HsstBTreeMerger's NWayMerge / NWayMergeKeyFirst lose the trailing `valueMerger.OnKey(cursor.MinKey)` dispatch on every emitted key. IHsstPackedArrayMergeCallback.OnKey and IHsstTwoByteSlotMergeCallback.OnKey stay — they're the only callbacks those simpler mergers expose, and their implementations all do non-trivial per-key work. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…reeNodeLayoutPlanner Both parameters of PlanFromProfile were write-only: callers passed values that were never read by any code path, only re-assigned by the slot-widening block. Removing them lets Plan skip the secondLen/allSameLenExceptFirst bookkeeping in its lengths-scan loop. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…BTreeBuilder.BuildIndex BuildIndex no longer runs a leaf phase: EmitInlineLeaf and WrapLoneEntryAsLeaf emit any leaves before BuildIndex runs and call EnsureSize themselves. The minLeafEntries parameter was unread; maxLeafEntries only inflated the ValueScratch pre-size for nodes BuildIndex doesn't write. Also clear up two stale comments referencing the removed OpenReader / LeafBoundaryEnumerator / PrecomputeCommonPrefixLengths code paths. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…and stale members Removed: - IByteBufferWriter.Copy<TWriter,TReader,TPin>(...) 3-param reader-bulk-copy overload — every caller uses the 2-param Span overload. - IHsstByteReader<TPin>.Bound { get; } and all 8 implementations (SpanByteReader, PooledByteBufferWriter.Reader, ArenaBufferReader, ArenaByteReader, WholeReadSessionReader, MmapByteReader test impl, HsstReaderTests CopyOnlyByteReader, HsstDenseByteIndexTests trailer/spec-stage readers). Property had no readers. - KeyValueEntry.KeyLength property and constructor parameter — every consumer uses only ValueBound; the current key length is available via HsstEnumerator.CurrentKeyLength on callers that need it. - HsstPackedArrayBuilder.WriteLeb128 — orphan private method. - HsstTwoByteSlotValueReader.TryResolveAll and HsstTwoByteSlotValueLargeReader.TryResolveAll — only HsstDenseByteIndexReader.TryResolveAll is invoked across the codebase. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Every address-bound cache miss in TryGetAddressBound pays two reads in HsstBTreeReader.TrySeek before stepping onto the root node: the 5-byte trailer (RootPrefixLen / RootSize / KeyLength) and the variable-length root prefix. The address column is immutable for the life of a snapshot, so those values are too. Resolve the address column bound and read the trailer + root prefix once at construction, then route the miss path through a new TrySeekFromRoot entry point that takes the precomputed root descriptor and starts the walk directly. Length == 0 on the cached bound is the sentinel for "snapshot has no address column" and short-circuits the lookup to "no entry". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

PersistedSnapshotUtils.cs: remove `// 1. Accounts` / `// Deserialize X` labels that just restated the loop variable. BloomFilterTests.cs: drop AAA banners from a 4-line test. PersistenceManagerTests.cs: trim `Bug A/B regression` prefixes and `OLD ... was removed` / `previously skipped` framing — keep the current invariants each test asserts. HsstBTreeBuilder{,Buffers}.cs: drop `(formerly HsstIndexBuilder)` and `Previously stackalloc'd / Promoted to` framing; describe the current pooled design directly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nch path Bug: ArenaReservation.PersistOnShutdown() was a 1-line forwarder to _arenaFile.PersistOnShutdown() — it set no per-reservation flag. So when ArenaReservation.CleanUp ran on a snapshot inside Repo.Dispose, the punch-hole reclaim path was gated only on MarkDead's structural return value (sibling reservations still live in this file?). In a shared arena where snap1 is disposed while snap2 still holds a reservation, MarkDead returns true and TryPunchHole zeros snap1's range in the preserve-flagged-but-still-live file. The file-level _preserveOnDispose flag only suppressed File.Delete in ArenaFile.CleanUp; it had no effect on the punch. Fix: add _preserveOnDispose to ArenaReservation (same set-once Interlocked.Exchange pattern as ArenaFile), set it in PersistOnShutdown before forwarding to the file, gate TryPunchHole on it in CleanUp. Reservation-level intent is now symmetric with file-level intent: both layers honor the same opt-in. Pruning path unaffected — pruned snapshots never call PersistOnShutdown, so the flag stays 0 and TryPunchHole runs as before, reclaiming the disk blocks of pruned reservations. Test: parameterize Repository_Restart_PreservesAllData with two maxArenaSize values. 4 KiB (each snapshot's metadata reservation page-rounds to fill its arena file → MarkDead fully-dies on the sole reservation → no punch path → masked the bug pre-fix). 1 MiB (both snapshots' reservations pack into one arena → MarkDead returns true on the first dispose → reproduces the bug pre-fix). Before this fix the 1 MiB variant failed on Assert.That(snap2.TryLoadStateNodeRlp(path2, ...), Is.True) because snap2's metadata HSST was zeroed. After this fix both variants pass. Full Nethermind.State.Flat.Test → 870/870 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ource persist Sibling to the existing AddToPersistence_InMemoryPersist_PrunesPersistedTier for the persistedToPersist branch (line 426-432 of PersistenceManager). A tier-source persist must also drive PruneBefore so the in-memory tier doesn't keep growing with entries that RocksDB now supersedes. Setup mirrors the existing DetermineSnapshotAction_FinalizedNoInMemory_ FallsBackToPersistedSnapshot test: empty arena reservation wrapped in a PersistedSnapshot, stubbed TryLeaseSnapshotTo + LeaseBaseSnapshotsInRange to feed DetermineSnapshotAction's tier fallback path. Test passes — confirms the persistedToPersist branch correctly invokes _repo.PruneBefore(persistedToPersist.To) after PersistPersistedSnapshot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Centralises every boundary-math site in the persistence pipeline behind ICompactionSchedule so the per-instance offset (used to stagger multi-node deployments) reaches every layer that decides when to compact. Pre-fix: only SnapshotCompactor honoured the offset. PersistenceManager accepted ICompactionSchedule as a ctor dep but never called a method on it. PersistedSnapshotCompactor didn't even take the dep. Both re-implemented the alignment math inline (`b & -b`, `b % CompactSize`), defeating the schedule's anti-lockstep design intent at the persisted tier. Interface gains three intent-revealing methods (no parameter overload — one method per use case): bool IsFullCompactionBoundary(long blockNumber); long GetHierarchicalCompactSize(long blockNumber); bool IsHierarchicalBoundary(long blockNumber); All three reuse a private ShiftedAlignment helper inside CompactionSchedule that applies the offset once. Unlike GetCompactSize / NextFullCompactionAfter, the three new methods do NOT short-circuit on `_compactSize <= 1` — PersistedSnapshotCompactor runs with its own min/max caps independent of config.CompactSize and the tests construct that edge case. Migrated call sites: PersistedSnapshotCompactor.cs:57 → GetHierarchicalCompactSize PersistedSnapshotCompactor.cs:72 → IsFullCompactionBoundary PersistenceManager.cs:120 → IsFullCompactionBoundary PersistenceManager.cs:129 → GetHierarchicalCompactSize PersistenceManager.cs:150 → IsHierarchicalBoundary PersistenceManager.cs:237/244 → NextFullCompactionAfter (existing method) PersistedSnapshotCompactor's ctor gains an ICompactionSchedule param (wired in FlatWorldStateModule + 4 test fixtures). PersistenceManager's dead-code xmldoc on _schedule is removed. _compactSize field stays for span-width checks (lines 238/321/379/601 measure snapshot extents, not boundary decisions, and don't need the offset). Tests: 3 parameterized tests added to CompactionScheduleTests covering the new methods at offset 0 and 3. Full Nethermind.State.Flat.Test -> 892/892 pass + 7 pre-existing skips. Sanity grep: after this commit, no production file under Nethermind.State.Flat/ contains the inline `b & -b` or `% _compactSize` expressions except a comment in CompactionSchedule.cs itself. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…anedFrontier Sibling to the ArenaReservation punch-on-preserve bug (8ace70c). PersistedSnapshot.CleanUp drops its blob lease then calls TryResetOrphanedFrontier when the file has only the manager's lease left — which is the case for every blob the last surviving snapshot referenced during repo dispose. TryResetOrphanedFrontier punched the WHOLE [0, prev) range without consulting the file's _preserveOnDispose flag, so blobs the next session needs to rehydrate would survive on disk (BlobArenaFile.CleanUp honours the flag for File.Delete) but their contents would all read as zeros. Symptom in a real workload: 32 blob frontier-resets logged as punched=True followed by 32 BlobArenaFile.CleanUp with preserved=True — all 32 files kept on disk, all 32 zeroed. Fix: add BlobArenaFile.IsShutdownPreserved accessor (mirrors the existing ArenaFile shape), gate TryResetOrphanedFrontier on it. Pruning path unaffected — pruned snapshots never call PersistOnShutdown, so the flag stays 0 and the orphan-frontier reclaim still punches the blob to free disk blocks. Regression test: extend Repository_Restart_PreservesAllData to write 10 valid-RLP-framed state nodes per snapshot (~500 bytes each → cumulative blob frontier > 4 KiB so the punch over [0, frontier) actually zeros the first OS page; sub-page punches are no-ops on tmpfs and would have let the test silently pass with the bug present). Each node round-trips through TryLoadStateNodeRlp post-reload. Pre-fix the test fails with "Expected is Byte[503], actual is Byte[1]" — the punched RLP prefix reads as zero, the reader interprets it as RLP empty-string and returns a single-byte array. Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Replaces pre-extend-then-punch-hole with start-at-0-then-truncate so the blob file's on-disk length always tracks the actual data extent. Before: BlobArenaFile ctor pre-extended every new file to MaxSize via sparse ftruncate, and TryResetOrphanedFrontier reclaimed via fallocate(PUNCH_HOLE) + posix_fadvise(DONTNEED). Two side effects: 1. FileInfo.Length == MaxSize always, diverging from Frontier from creation onward. 2. Initialize's `frontier = FileInfo.Length` restore therefore reported MaxSize for any file that had been written to. The headroom check `len < _maxFileSize` was always false, so restored files never re-entered _mutableFiles — packing-reuse was strictly in-process. The recently-fixed punch-on-preserve bug (8a78319) was made worse because `prev = ReportedFrontier = MaxSize` meant the buggy punch wiped the entire file's actual data, not just the originally-written prefix. Now: BlobArenaFile ctor skips the pre-extension (FileStream.Write auto-extends, and BlobArenaWriter's 1 MiB internal buffer amortises growth syscalls anyway). TryResetOrphanedFrontier calls SetFileLength(0) — one syscall that zeros the logical length AND frees all disk blocks AND implicitly invalidates the page cache for the range. No fadvise fallback needed. The IsShutdownPreserved guard at the top of TryResetOrphanedFrontier stays — truncating a preserve-flagged file is strictly worse than punching it (file length 0 → NodeRef offsets point past EOF → reads fail outright). Side benefit: restored files now correctly report headroom via `len < _maxFileSize`, so packing-reuse survives restarts. Dead code removed: BlobArenaFile.PunchHole, BlobArenaManager's _punchHoleOnReclaim field + _punchHoleSupported field + PunchHoleSupported property + punchHoleOnReclaim ctor param + PersistedSnapshotPunchHoleEnabledByTier[_tier] write. The shared config flag PersistedSnapshotPunchHoleOnReclaim stays — still consumed by ArenaManager (arena side untouched). Tests: - ArenaReclaimPunchHoleTests.BlobFrontierReset_PunchesHole_* rewritten as BlobFrontierReset_TruncatesFile_ForOrphanedRange — single non-parameterized test that asserts FileInfo.Length == 0 after TryResetOrphanedFrontier. - Repository_Restart_PreservesAllData gains an inter-session blob-length check: every preserved blob_*.bin must have 0 < length <= MaxSize after Repo.Dispose. Catches a regression to truncating-on-preserve (length 0) OR pre-extending (length > MaxSize). - BlobArenaFile.FadviseDontNeed stays (still used by PersistedSnapshot.AdviseDontNeedBlobRange's read-side cache hint). Full Nethermind.State.Flat.Test -> 891/891 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nvert/compact A persisted snapshot's bytes lived only in the OS page cache between the writer's Complete and the next persistence-layer checkpoint. A crash in that window left the catalog (committed via the RocksDB-backed SnapshotCatalog) pointing at pages that were never durably on disk — session 2 would Initialize and TryLeaseSnapshotTo successfully but reads would return whatever the kernel had not yet flushed. Fix: fsync the metadata arena (always) and the blob arena (base snapshots only) between writer.Complete and _catalog.Add. Catalog never records an entry that references unsynced pages. Convert path (PersistedSnapshotRepository.ConvertSnapshotToPersistedSnapshot): both reservation.Fsync() + blobWriter.Fsync(). Base snapshots write trie-node RLPs to a blob arena, so both files need the barrier. Compact path (PersistedSnapshotCompactor.CompactRange): reservation.Fsync() only. Compaction emits NodeRefs into the existing base blob arenas; it writes no new blob bytes, and those base blobs were fsynced when their respective base snapshots were converted. Plumbing: - PosixReclaim.Fsync(int fd) — libc fsync(2), no-op on non-Linux for cross-platform test runs, throws IOException on errno. - ArenaFile.Fsync() / BlobArenaFile.Fsync() — thin wrappers. - ArenaReservation.Fsync() — forwards to the underlying arena file (the post-Complete handle the convert/compact caller already holds). - BlobArenaWriter.Fsync() — forwards to its file; asserts the writer has been Complete'd so the managed buffer is flushed first. Full Nethermind.State.Flat.Test -> 891/891 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…t durability barrier fdatasync skips the mtime/ctime flush that fsync would do but still flushes the file size — which is the only metadata change the auto-grown blob file produces and the only metadata our reads depend on (catalog RocksDB has its own durability barrier; we only need the arena/blob file's data + size to be on disk before the catalog records the entry). On ext4 the journal entry is smaller and the commit can occasionally be skipped when only mtime would have been dirty, giving ~10-30% savings per durability barrier. Drop-in: PosixReclaim.Fsync still surfaces as .Fsync() everywhere (ArenaFile, BlobArenaFile, ArenaReservation, BlobArenaWriter) — just the underlying syscall changed. Full Nethermind.State.Flat.Test -> 891/891 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…arker The auto-grow design (e5a3594) made FileInfo.Length the frontier source, which is unreliable under torn writes: a crash mid-FlushBuffer leaves bytes past the writer's last-committed frontier; restart treats those uncommitted bytes as valid data and readers parse trailing partial RLPs as garbage (or surface as RlpException). The earlier "restart-seals-each-file" symptom that originally motivated dropping pre-extension was actually caused by FileInfo.Length-based frontier restore being conflated with on-disk file size — separable concerns. Fix: 1. Restore pre-extension to MaxSize at file creation (sparse ftruncate). The blob writer never has to grow the file inline; disk blocks remain lazily allocated. 2. Reserve the first 4 bytes of every blob file as an on-disk frontier marker (int32 LE = absolute next-write offset). BlobArenaWriter.Complete publishes the new frontier into the marker; BlobArenaManager.Initialize reads it. Frontier recovery is decoupled from FileInfo.Length. 3. Reset becomes punch-hole + marker reset (NOT truncate). Punching [HeaderSize, prev_frontier) reclaims disk blocks while the file's logical length stays at MaxSize; the marker resets to HeaderSize so restart sees "no data". Marker reset is sequenced BEFORE the punch — a crash between the two leaves a consistent fresh-marker / stale-data state instead of a stale-marker / zeroed-data state. 4. Headroom check in Initialize uses Frontier (marker-derived), not FileInfo.Length (always MaxSize after pre-extension) — restored files now correctly re-enter the packing pool. NodeRef.RlpDataOffset semantics unchanged — still a file-absolute byte offset. Fresh-file offsets begin at HeaderSize=4 instead of 0. Readers use whatever offset the writer returned; no read-path change. BlobRange shifts automatically through blobWriter.StartOffset/Written. The IsShutdownPreserved guard at the top of TryResetOrphanedFrontier stays unchanged — without it, a preserve-flagged file's data range would still be punched on the last lease drop. Marker reset would also fire, leaving the file with a fresh marker pointing at zeroed data, which is strictly worse than the original sparse-hole symptom. Tests: - BlobFrontierReset_TruncatesFile_ForOrphanedRange rewritten as BlobFrontierReset_PunchesHoleAndResetsMarker_ForOrphanedRange: file length unchanged at MaxSize, marker reset to HeaderSize, disk blocks freed (StatBlocks). - LongFinalityIntegrationTests.Repository_Restart_PreservesAllData tightens blob-length assertion to == MaxSize. - New regression: Repository_Restart_IgnoresTornWritePastFrontierMarker writes a snapshot, fsyncs (marker durable), then manually appends 4 KiB of garbage past the marker, reloads, asserts the snapshot round-trips intact and Frontier matches the marker (not file length). Verified pre-fix this test fails (Frontier becomes len, garbage is interpreted as committed data). Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Resolved using-directives conflict in PersistenceManagerTests.cs (kept our System.Threading.Tasks for async TearDown; dropped the unused FluentAssertions import). - Migrated 7 branch-side test files from FluentAssertions to NUnit Assert.That style, matching master's branch-wide "Unify assertions" (ba5a917) cleanup which removed the FluentAssertions package: ArenaReclaimPunchHoleTests.cs ArenaMetricsTests.cs ArenaManagerEvictionQueueTests.cs ArenaBufferWriterReaderTests.cs PageResidencyTrackerTests.cs ArenaManagerForgetOnAdviseTests.cs Persistence/BloomFilter/BloomFilterTests.cs Mapping: .Should().Be(x) -> Assert.That(., Is.EqualTo(x)), .Should().Be{True,False,Null,Empty}() -> Assert.That(., Is.{True,False,Null,Empty}), .Should().Be{Greater,Less}Than(x) -> Assert.That(., Is.{GreaterThan,LessThan}(x)), .Should().HaveCount(n) -> Assert.That(., Has.Count.EqualTo(n)), .Should().Equal(a, b, c) -> Assert.That(., Is.EqualTo(new[] { a, b, c })), .Should().Contain(x) -> Assert.That(., Does.Contain(x)), .Should().Throw<E>() -> Assert.That(., Throws.TypeOf<E>()), .Should().ContainSingle().Which.Should().Be(x) -> Assert.That(., Is.EqualTo(new[] { x })). Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ersisted Reorders the four child edges explored from each BFS node: before: 0 in-mem compacted, 1 persisted compacted, 2 persisted base, 3 in-mem base after: 0 in-mem compacted, 1 in-mem base, 2 persisted compacted, 3 persisted base The previous order had the in-mem base last, breaking the "no-disk-read wins among ties in width" principle that the compacted-tier ordering already followed. Combined with the gating at line 90 (`if (currentPersisted && edgeIsInMemory) continue;`) the old order caused the BFS to commit to the persisted tier the first time both an in-mem and a persisted base existed for the same `to` — typically during a convert window — barring any wider in-mem compacted skip-pointer that might have existed downstream and produced a shorter overall bundle. New rule (now consistent): prefer in-RAM over disk; among ties prefer wider. This also keeps the BFS option-set open for the remainder of the traversal in the rare both-tiers-overlap case. No new tests — the existing 892-test suite passes unchanged, confirming no test relied on the old persisted-base-first order at a tier-overlap point. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ule offset The alignment value comes from _schedule.GetHierarchicalCompactSize which applies the per-instance offset (ShiftedAlignment: `(b + _offset) & -(b + _offset)`). The compaction trigger fires when this offset-shifted alignment hits the desired tier — i.e. the window the trigger selects is (snapshotTo - alignment, snapshotTo] in raw block-number space. The previous formula `((blockNumber - 1) / alignment) * alignment` is "largest multiple of alignment ≤ blockNumber - 1", which equals `blockNumber - alignment` ONLY when blockNumber mod alignment == 0 — i.e. when offset == 0. With a non-zero offset, blockNumber mod alignment is some non-zero value (call it k), so the formula gives: startingBlockNumber = blockNumber - k producing an output span of k blocks instead of the alignment-sized window the trigger actually selected. Observed in production logs: requestedCompactSize=128 at to=23448506 with a non-zero offset produced startingBlock=23448448 (span 58 instead of 128). Same shape for P10, P26 — all (block mod alignment) under the configured offset. The compacted output's actual block range is wrong, so the hierarchical compaction tier accumulates short windows that don't correspond to any real CompactSize-aligned coverage. Fix: replace with the correct window left edge. startingBlockNumber = blockNumber - alignment DoCompactPersistable's `blockNumber - _compactSize` was already correct — it never used the buggy formula — so this only affects the sub-CompactSize hierarchical-merge path. Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…fset bug Pins the fix in 718b2dc: with a non-zero schedule offset, the previous ((blockNumber - 1) / alignment) * alignment formula produced a window shorter than the alignment that triggered it. The new (blockNumber - alignment) gives the correct (snapshotTo - alignment, snapshotTo] window the offset-shifted trigger actually selects. Geometry: offset=3, CompactSize=64, maxCompactSize=32. At block 45, (45 + 3) & -(45 + 3) = 16 fires alignment=16. Window must be (29, 45] (span 16, From.BlockNumber=29). Pre-fix produces (32, 45] (span 13, From.BlockNumber=32) — verified by temporarily reverting the formula and confirming the new test fails with "Expected: 29, But was: 32" before re-applying the fix. Full Nethermind.State.Flat.Test -> 893/893 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…ontier marker" This reverts commit 9829572. Restores the auto-expanding blob design (e5a3594): - BlobArenaFile starts at length 0; FileStream.Write auto-extends. - No 4-byte on-disk frontier marker. - BlobArenaManager.Initialize restores frontier from FileInfo.Length. - TryResetOrphanedFrontier truncates to 0 (SetFileLength(0)) — no punch-hole, no PunchHole method on BlobArenaFile, no PunchHoleSupported on BlobArenaManager. - The IsShutdownPreserved guard at the top of TryResetOrphanedFrontier stays (added in 8a78319, predates the marker commit). - The torn-write regression test (Repository_Restart_IgnoresTornWrite- PastFrontierMarker) added by the marker commit is also removed. Conflict resolution: ArenaReclaimPunchHoleTests.cs's BlobFrontierReset test body merged using the parent's assertion logic but with the master merge's NUnit Assert.That style (FluentAssertions is gone branch-wide since ba5a917). Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

GatherReadOnlySnapshotBundle leased one PersistedSnapshotBloom per persisted snapshot keyed only by To. A registration race with compaction can leave a narrower bloom (e.g. a base bloom) at a wider compacted snapshot's To slot. The bundle uses the bloom as a negative read filter, so an under-covering bloom returns false for keys touched in the uncovered block range, silently skipping the snapshot and yielding stale/missing state. Add a LeaseOrSentinel(from, to) overload that only leases when the entry's From covers the requested range, falling back to AlwaysTrue otherwise, and call it from the bundle method. Mirrors the existing silent sentinel-on-miss behavior. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nstructBloom LoadFromCatalog used to build a real bloom for every loaded snapshot inline — one HSST scan per catalog entry on every process start, even though most of those per-base blooms are immediately superseded by a wider compacted/persistable bloom that exists in the same catalog. Drop the inline build. After all snapshots are routed into their buckets, do one pass that walks the union of every bucket's To set newest→oldest, picks the widest snapshot at each uncovered slot (range-compare across all three buckets, nullable base for slots where only a compacted/persistable survived), builds its bloom, and registers it. Register's range walk then fills every covered slot in one shot; subsequent (lower-block) iterations skip via ContainsSlot. PersistedSnapshotBloomFilterManager.Register gains an optional parentLookup delegate so a compacted-on-unpopulated-slot registration (only reachable from ReconstructBloom — runtime callers don't pass it) can synthesize the chain entry by stepping back through the predecessor base StateId graph, instead of breaking the way the original "compacted on unpopulated key: stop" guard does for the runtime compaction path. Net effect: bloom build count on startup drops from "one per snapshot" to "one per maximal-covering snapshot in the graph"; manager end-state matches the runtime end-state after a long-running session's compactions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…s per state The SnapshotCatalog key was 40 bytes (To.BlockNumber + To.StateRoot), with the SnapshotKind and From only in the value. When base + persistable share a To (the persistable assembled from a base's window ends at the same block as the last base), the second Add overwrites the first in the underlying IDb; on restart only the last-written kind survives at that To. The bloom test added in the previous commit surfaced this by observing the lost base after reload. Extend the catalog key with the snapshot's depth (To.BlockNumber - From.BlockNumber) as 8 bytes big-endian — preserving the existing all-ascending lex order. The (To.BlockNumber, To.StateRoot, depth) triple is globally unique by construction: base has depth=1, persistable has depth=CompactSize, sub-CompactSize compacted has depth ∈ {2, 4, …, CompactSize/2} (depth=1 is excluded since a merge has ≥2 sources, depth=CompactSize is explicitly skipped in DoCompact), and hierarchical compacted has depth ∈ {2·CompactSize, …}. Catalog version bumps v6 → v7 — old directories trigger the existing wipe-and-resync error in Load(). Metadata key length 4 stays distinct from the new 48-byte entry key, so length-based discrimination in Load() is unchanged. Add(entry) derives depth internally so production call sites in ConvertSnapshotToPersistedSnapshot and AddCompactedSnapshot need no edit. The read/remove/update side (Remove, Find, UpdateLocation) gains a depth parameter because those callers don't carry a full CatalogEntry; PruneBucketBeforeLocked captures depth from the still-alive snapshot before Dispose, then passes it to RemoveFromCatalog so the catalog removal scopes to this bucket's entry alone. Tests: - SnapshotCatalog_SaveLoad_RoundTrips extended to write three entries at the same To with depths 1/2/4 (base / compacted / persistable) plus a tail entry; all four round-trip distinct via Find(to, depth) - SnapshotCatalog_Remove_And_Find / UpdateLocation updated to the new signatures; adds a case "remove one depth at a To leaves the sibling intact" - LoadFromCatalog_ReconstructsBloom_FromWidestCoveringSnapshot strengthened: base at the persistable's To now lease-succeeds after reload alongside the persistable - New LoadFromCatalog_RoundTripsBaseAndPersistableAtSameTo: 4 bases + 1 persistable in session 1, SnapshotCount==5 in session 2 (was 4 pre-fix) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…gress logging For a multi-thousand-entry catalog, startup was serial and silent: LoadFromCatalog opened an arena reservation + leased blob arenas + constructed a PersistedSnapshot per entry, then ReconstructBloom built (and registered) one bloom per picked snapshot. All on the loading thread, all under _catalogLock, no progress reported. Both phases have plenty of independent work and the underlying primitives are already concurrency-safe (ConcurrentDictionary buckets, lock-free ArenaManager.Open / BlobArenaManager.TryLeaseFile, CAS-based PersistedSnapshotBloomFilterManager.Register with TryAdd / TryUpdate retry). The only non-concurrent state is the three SortedSet<StateId> ordered ids plus _lastRegisteredState. Refactor: LoadFromCatalog - LoadSnapshot strips the SortedSet registration — it now only mutates the ConcurrentDictionary bucket and Interlocked metric counters and is safe under Parallel.ForEach. - New private LoadSnapshotsParallel runs the Parallel.ForEach with an optional ProgressLogger + System.Timers.Timer heartbeat (every 1s), gated on ParallelLoadThreshold = 1024 entries (and _logger.IsInfo). - After the parallel section, a serial post-pass walks the catalog-sorted entries and calls RegisterStateIdLocked into the right SortedSet — the ordering puts _lastRegisteredState on the highest registered StateId naturally, no ComputeLastRegisteredLocked() needed. ReconstructBloom - Phase A (serial, cheap): walk the union of every bucket's To ids newest→oldest, simulating Register's chain walk locally via a HashSet<StateId> coveredSlots set (SimulateRegisterFill mirrors Register's cur.BlockNumber > fromBlock exit and parentLookup-stepping so the simulated fill matches Register's actual fill, including the early termination when parentLookup returns default past a pruned gap). Collects picks in newest→oldest order. - Phase B: picks.Reverse() — the older end of allTos is where the wider persistables and hierarchical merges accumulate; putting them first in the parallel queue lets Parallel.ForEach's partitioner give the long-running bloom-builds to threads first (LPT scheduling) so a single big bloom doesn't dominate the tail. - Phase C (parallel): per pick, build the bloom via WholeReadSession + PersistedSnapshotBloomBuilder.Build, then Register. Picks have disjoint slot ranges by construction (Phase A's coveredSlots), so parallel Register calls touch different _blooms keys. Same ParallelLoadThreshold gates the progress logger + heartbeat. Constructor change PersistedSnapshotRepository ctor accepts an ILogManager (mirroring PersistedSnapshotCompactor's pattern). FlatWorldStateModule resolves it from the container; all test and benchmark call sites pass LimboLogs.Instance. Tests - New LoadFromCatalog_Parallel_PreservesOrderingAndDicts: 32 bases + 2 persistables (CompactSize=8) in session 1, reload in session 2 — asserts SnapshotCount, per-bucket TryLease*To for every entry, the From-chain walk via TryGetSnapshotFrom, and bloom-slot identity across a persistable's range. Stays below the threshold so the gate stays a one-line by-inspection branch. - All existing PersistedSnapshotRepository(...) call sites threaded through LimboLogs.Instance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Revert the demote-to-AlwaysTrue path: a demoted sub-CompactSize intermediate again keeps its real merged bloom (Demote only advises its pages cold). That optimization was too narrow — it freed only the sub-CompactSize intermediates' blooms and at the cost of all pre-filtering (the AlwaysTrue sentinel). Instead, when a >CompactSize merge produces a PersistedLargeCompacted snapshot over the canonical range (from, to], its merged bloom is a correct superset pre-filter for every persisted snapshot fully contained there. Walk that range block-by-block by parent across all four buckets and adopt the single big bloom into each contained snapshot, freeing its own (multi-MiB) bloom while still pre-filtering. Bounded by persistence lag, off the hot path. Sharing one bloom safely requires: - A separate RefCountedBloomFilter wrapper (SmallRefCountingDisposable) owns the lease count and the bloom-memory metric, leaving BloomFilter a pure data structure. The wrapped filter frees only once the big snapshot and every twin (and their in-flight readers) release it — a reader holding a snapshot lease keeps the wrapper, hence the filter, alive regardless of prune order. A shared filter is counted in the metric once. - Replacement goes through re-registration (a twin over the same reservation), not in-place mutation: MightContain is a lock-free check-then-use on the data pointer, so swapping+freeing a live snapshot's bloom is a use-after-free. The old instance keeps its own bloom until its refcount drains. The per-snapshot bloom is therefore immutable — always supplied to the ctor. The reload path no longer mutates it (SetBloom is gone); ReconstructBloom builds each snapshot's real bloom and re-registers an equivalent snapshot over the same reservation carrying it, replacing the AlwaysTrue placeholder. Snapshots extending below `from` are skipped (the big bloom is not a superset of their keys — sharing would cause false negatives). Pure live-memory optimization: blooms are not persisted, so reload rebuilds independent blooms. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ReconstructBloom built an independent bloom for every loaded snapshot, so a restart re-inflated bloom memory (and startup CPU) versus the running state — where, after a large compaction, contained snapshots share the wide skip-pointer's bloom and free their own. This restores the shared-bloom-on-load behavior removed by 0fd5067, now driven by the runtime ShareBloomAcrossRange. Reload now assembles the widest-first persisted chain with the main read-path AssembleSnapshots (from the head down to the committed base = the oldest loaded snapshot's From), then rebuilds a bloom only for each (widest) snapshot in that chain — in parallel — and shares each across its range. The contained narrower snapshots adopt it instead of carrying their own; snapshots no widest one covers keep the AlwaysTrue placeholder (correct, just unfiltered). The chain ranges are disjoint, so the parallel shares don't collide. Confined to the loader — no new repository/interface surface. ShareBloomAcrossRange blooms the widest (starting) snapshot itself: at `to` it re-registers the entry in every tier, and the placeholder bloom isn't the fresh one so it isn't skipped (unlike the runtime path, where the big snapshot already carries the bloom and is skipped via ReferenceEquals). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A large compaction adopts one bloom across the snapshots it contains, so a later compaction can assemble several sources pointing at the same filter, each reporting the whole window's key count. Summing per source inflated bloomCapacity (and the allocated merged filter) by the number of sharers. Dedup by bloom owner so a shared filter is counted exactly once. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add PersistedSnapshotBloomCount, incremented/decremented in the RefCountedBloomFilter ctor/CleanUp alongside the existing bloom-memory accounting. Keyed to the wrapper, a bloom shared across snapshots counts once, so reading it against the active-snapshot count measures large-compaction bloom-sharing effectiveness. Marked [DetailedMetric]. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ted table Split the HSST on-disk format out of flat/long-finality: persisted-snapshot metadata is now one deliberately-unoptimized, single-level binary-search SortedTable instead of the nested columnar HSST. Trie-node RLP still lives in blob arenas as NodeRefs. - Add Sorted/{SortedTable,SortedTableBuilder,SortedTableReader, SortedTableEnumerator} + FORMAT.md. Interleaved [ks][key][vs][value] records, a u32 offset index, and a footer; the builder buffers all keys and sorts the offsets at Build, the reader binary searches. - PersistedSnapshotKey materializes verbose <column><addr><subcolumn><slot> keys with the column/subcolumn tag bytes stored as 255-tag, so a plain ascending sort reproduces the HSST reverse-tag emission order. The builder/compacter ordering and comparers are unchanged for a future proper-HSST swap. - Rewrite builder/reader/merger/scanner and PersistedSnapshot over the sorted table; drop AddressBoundCache and the BTree-root caching. The merger keeps newest-wins plus per-address self-destruct truncation. - Remove all HSST format code (BTree/PackedArray/DenseByteIndex/TwoByteSlot, dispatchers, merge plumbing); keep the seam interfaces, the Storage arena layer, bloom, and orchestration. - Bump SnapshotCatalog version 1->2 (old HSST dirs require wipe-and-resync). - Delete the HSST format unit tests; add SortedTableTests. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

claude · 2026-06-23T07:20:19Z

Claude finished @asdacap's task in 10m 4s —— View job

PR Review — `refactor(flat): replace persisted-snapshot HSST with single-level sorted table`

Gather context and understand the PR
Review core sorted table format implementation
Review persisted snapshot builder/reader/merger/scanner
Review storage layer (arena, blob arena, catalog)
Review key encoding and ordering correctness
Review resource management and dispose patterns
Review tests
Post inline comments and final summary

Summary

The PR cleanly replaces the nested HSST columnar format with a simpler single-level sorted table backed by a binary-search reader, and the overall design is sound. The new SortedTable wire format is well-specified and the FORMAT.md is a good addition. The arena/blob-arena storage layer is sophisticated but well-structured. However, four High issues need to be resolved before merge:

Critical: 0 | High: 4 | Medium: 5 | Low: 4

High

H1 — SmallRefCountingDisposable: resurrection race window (line 41)

Between the CAS that drives _leases from 1 → 0 and the second CAS that drives 0 → Disposing, a concurrent TryAcquireLease sees _leases == 0 (NoAccessors), passes the == Disposing guard, and successfully CASes 0 → 1. The caller holds a live lease on an object that has logically hit zero count. Fix: also treat NoAccessors as a rejection in TryAcquireLease:

if (current is Disposing or NoAccessors) return false;

H2 — PersistedSnapshotMerger: TryRead return ignored; 0x00 default treated as Destructed (line 246)

SelfDestructDestructedMarkerByte = 0x00 and flag is initialised to 0. If TryRead returns false (corrupt value bound, seek failure), flag stays 0 = DestructedMarkerByte, so barrier = i is set. A re-created contract's NewMarker (0x01) entry failing to read silently triggers the self-destruct barrier, dropping that address's slots in the compacted snapshot.

H3 — PersistedSnapshotMerger: merged self-destruct tag is wrong for re-created contracts (line 250)

When a compaction range spans a destruction (source A, Destructed) followed by a re-creation (source B > A, New), the loop correctly sets barrier = A and keeps source-B's slots, but the emitted tag is Destructed (because barrier >= 0) rather than New. Stack consumers then see "contract destroyed" and refuse to serve the re-created contract's state. The emitted tag should be taken from the newest matching source, not from any historical destruction.

H4 — PersistedSnapshotBuilder: TryGet* return ignored, null-forgiving dereference (line 274)

snapshot.TryGetStateNode(path, out TrieNode? node) and snapshot.TryGetStorageNode(...) return values are discarded, then node!.FullRlp is dereferenced. Keys are pre-extracted with IsPersisted = true, but a concurrent prune or eviction between extraction and the write pass can null the node, producing an unguarded NullReferenceException that silently aborts the snapshot build.

Medium

M1 — ArrayPool rental not in finally (line 261)
rlpBuffer is rented at line 202 and returned at line 261. Any exception inside the per-address loop leaks the rental. Wrap in try/finally.

M2 — SortedTable.TryReadFooter: signed overflow in n * OffsetSize (line 62)
For a crafted/corrupt footer with n near long.MaxValue / 4, n * OffsetSize silently overflows to a negative value; the > table.Length guard then passes, corrupting offsetRegionStart. Pre-validate n <= (table.Length - FooterSize) / OffsetSize before multiplying.

M3 — MergeEntries: stackalloc int[n] with runtime-variable n (line 105)
Unbounded stackalloc risks StackOverflowException for large compaction batches. Add an upper-bound guard or use a fixed stack size with heap fallback.

M4 — Leb128: signed/unsigned mismatch (file)
Method signatures accept long but encoding is unsigned LEB128. Write(-1) encodes 10 bytes; Read returns long.MaxValue. Either change to ulong or document the non-negative constraint explicitly.

M5 — PersistedSnapshotCompactor.EnsureStarted not thread-safe
The _compactPersistedTask ??= ... and _boundaryCompactorTasks is null checks are unguarded. Two concurrent callers could spawn duplicate worker sets and race to consume the bounded channel. Should be protected by the existing _lock or Lazy<T>.

Low

L1 — ArenaReservation.CleanUp: wasted FadviseDontNeed when file is already dead (line 192)
FadviseDontNeed fires even when fileSurvives == false (file is about to be deleted). Gate it on fileSurvives.

L2 — SortedTableBuilder.Build: raw pointer captured before Sort (line 70)
keyBase is snapshotted from _keyBuf before _entries.Sort. Safe today because _keyBuf is never written during the sort, but there is no comment asserting this invariant. Add one to prevent a subtle use-after-realloc in future.

L3 — SortedTableEnumerator: footer failure is silent
When TryReadFooter returns false, the enumerator behaves as empty with no way for callers to distinguish "empty table" from "corrupt table". An IsValid property or a thrown exception on first MoveNext would make misuse visible.

L4 — SortedTableEnumerator: key buffer grows to exact fit
if (keyLength > _keyBuf.Length) _keyBuf = new byte[keyLength] — allocates on every growth. Use Math.Max(_keyBuf.Length * 2, keyLength) to amortize.

Design note

The 255 − tag encoding used for column/subcolumn bytes in PersistedSnapshotKey is correct and produces the desired sort order. The inline comment at SortedTable that explains the sort → HSST compatibility story is clear and helpful for future reviewers.

claude · 2026-06-23T07:27:57Z

+    {
+        // Volatile read for starting value
+        long current = Volatile.Read(ref _leases);
+        if (current == Disposing)


High — TryAcquireLease can succeed on an object mid-disposal

Between the CAS that atomically moves _leases from Single (1) → 0 and the second CAS that moves 0 → Disposing (-1), a concurrent TryAcquireLease sees _leases == 0, which is NoAccessors — not Disposing — so the guard at line 41 passes and the CAS on line 49 (0 → 1) succeeds. The caller gets a live lease on an object that should be dead.

The object is cleaned up eventually (when the re-acquirer releases), but in between it passes TryAcquireLease-based guards that callers rely on to detect a doomed file/arena.

Fix: also reject NoAccessors in the initial check:

Suggested change

if (current == Disposing)

if (current is Disposing or NoAccessors)

This makes TryAcquireLease return false as soon as the count hits zero, closing the resurrection window. The inner-loop retry already handles the Disposing case, so this only adds the 0 guard to the fast-path.

claude · 2026-06-23T07:28:07Z

+            int i = matching[k];
+            byte flag = 0;
+            TReader r = views[i].CreateReader();
+            r.TryRead(enums[i].CurrentValue.Offset, new Span<byte>(ref flag));


High — TryRead return value ignored; failure silently masquerades as Destructed

SelfDestructDestructedMarkerByte = 0x00 and flag is initialised to 0. If TryRead returns false (e.g. a corrupted value bound), flag remains 0, which equals DestructedMarkerByte, so barrier = i is set. A corrupt/missing self-destruct entry for a re-created contract (which would correctly hold NewMarker = 0x01) is silently treated as a destruction, causing the merger to drop post-creation slots.

if (!r.TryRead(enums[i].CurrentValue.Offset, new Span<byte>(ref flag))) continue; // skip unreadable entries if (flag == PersistedSnapshotTags.SelfDestructDestructedMarkerByte) barrier = i;

claude · 2026-06-23T07:28:19Z

+            if (flag == PersistedSnapshotTags.SelfDestructDestructedMarkerByte) barrier = i; // newest destructed
+        }
+
+        table.Add(key, barrier >= 0


High — Merged self-destruct tag is wrong when a newer source re-creates the contract

The loop sets barrier to the newest source that wrote DestructedMarker, but the emitted tag uses barrier >= 0 to decide whether to write Destructed. If sources cover blocks where a contract was destroyed (source A) and then re-created (source B, index > A), the loop leaves barrier = A (the New byte 0x01 != 0x00 does not update barrier), and the emitted tag is Destructed even though the newest version is New.

Consumers of the compacted snapshot (e.g. PersistedSnapshotStack.TryGetSelfDestruct) will report the contract as destroyed and refuse to serve its storage slots, causing incorrect "contract doesn't exist" responses for a live re-created contract.

Fix: the emitted tag should be taken from the newest matching source rather than from the presence of any historical destruction:

// newest source wins for the written tag; barrier still controls slot filtering byte newestFlag = 0; r = views[matching[^1]].CreateReader(); r.TryRead(enums[matching[^1]].CurrentValue.Offset, new Span<byte>(ref newestFlag)); table.Add(key, newestFlag == PersistedSnapshotTags.SelfDestructDestructedMarkerByte ? PersistedSnapshotTags.SelfDestructDestructedMarker : PersistedSnapshotTags.SelfDestructNewMarker);

claude · 2026-06-23T07:28:29Z

+        {
+            TreePath path = keys[i];
+            snapshot.TryGetStateNode(path, out TrieNode? node);
+            NodeRef nr = blobWriter.WriteRlp(node!.FullRlp.AsSpan());


High — TryGetStateNode return value ignored; node! dereference will NRE if node was evicted

Keys are extracted from snapshot.StateNodes in the parallel lambda (line 87–95), where IsPersisted = true is set to signal that these nodes should not be evicted before the write pass. However if the IsPersisted flag race or a concurrent prune between the extraction and this write pass removes the node, TryGetStateNode returns false and node is null. The ! null-forgiving operator then throws NullReferenceException, crashing the persisted-snapshot build silently. Same pattern at line 300–301 for WriteStorageNodes.

At minimum, guard and throw a descriptive exception:

if (!snapshot.TryGetStateNode(path, out TrieNode? node) || node is null) throw new InvalidOperationException($"State node at {path} was evicted between extraction and persist.");

claude · 2026-06-23T07:28:41Z

+            }
+        }
+
+        ArrayPool<byte>.Shared.Return(rlpBuffer);


Medium — ArrayPool rental not in a finally block; leaks on any exception in the per-address loop

rlpBuffer is rented at line 202 and returned here. Any exception thrown inside the for loop (e.g. from table.Add, Rlp.Encode, or bloom operations) skips this Return, leaking the rental until the next GC:

byte[] rlpBuffer = ArrayPool<byte>.Shared.Rent(256); try { // ... for loop ... } finally { ArrayPool<byte>.Shared.Return(rlpBuffer); }

claude · 2026-06-23T07:28:50Z

+        long n = BinaryPrimitives.ReadInt64LittleEndian(footer);
+        if (n < 0) return false;
+
+        long offsetRegionLength = n * OffsetSize;


Medium — Signed overflow in n * OffsetSize before the bounds guard for adversarial inputs

n is read directly from the on-disk footer as an i64. If a crafted or corrupted file provides n near long.MaxValue / 4, the multiplication n * OffsetSize silently overflows to a negative value. The guard on line 63 would then pass (negative + FooterSize < table.Length), and offsetRegionStart would be computed from the wrong (negative) offset length. Pre-validate n before multiplying:

// Guard before multiplying to prevent signed overflow on adversarial input if (n > (table.Length - FooterSize) / OffsetSize) return false; long offsetRegionLength = n * OffsetSize;

claude · 2026-06-23T07:28:57Z

+        int barrier = -1;
+
+        Span<byte> minKey = stackalloc byte[PersistedSnapshotKey.MaxKeyLength];
+        Span<int> matching = stackalloc int[n];


Medium — stackalloc int[n] with runtime-variable, unbounded n

n = views.Length is the number of snapshot inputs. In practice this is small (typically 2–16), but there is no upper-bound guard. A large compaction batch (or a fuzzer passing many inputs) would cause StackOverflowException. A safe guard or heap fallback for large N:

Span<int> matching = n <= 64 ? stackalloc int[64] : new int[n]; matching = matching[..n];

claude · 2026-06-23T07:29:08Z

+{
+    [MethodImpl(MethodImplOptions.AggressiveInlining)]
+    public static long Read(ReadOnlySpan<byte> data, ref int offset)
+    {


Medium — Signed/unsigned mismatch: method signatures say long but encoding is unsigned LEB128

Write casts value to ulong before encoding, so negative long values are encoded as their two's-complement bit pattern (10 bytes of 0xFF/0x01). Read performs unsigned decode with no sign-extension, so Read(Write(-1)) returns long.MaxValue (not -1). Any caller passing a negative long will silently get a corrupt round-trip.

Either change both signatures to ulong, or add signed LEB128 support (sign-extend the final group's high bit). At a minimum, document the constraint in the XML summary:

/// <summary>Reads an unsigned LEB128-encoded value. Only non-negative <see cref="long"/> values /// are supported; the encoding is the two's-complement unsigned representation.</summary>

claude · 2026-06-23T07:29:18Z

+        // skipped.
+        bool preserve = Volatile.Read(ref _preserveOnDispose) == 1;
+        bool punched = !preserve && fileSurvives && _arenaManager.TryPunchHole(_arenaFile, Offset, footprint);
+        if (!punched)


Low — Wasted FadviseDontNeed syscall when the arena file is already dead

When fileSurvives == false, MarkDead has released the manager's lease on ArenaFile. The ArenaReservation still holds its own lease (so the handle remains open), but the file is about to be File.Deleted on the last-lease release at line 197. The fadvise call here is therefore wasted — the kernel will drop those pages when the file is unlinked anyway. Guard it on fileSurvives:

Suggested change

if (!punched)

if (!punched && fileSurvives)

_arenaFile.FadviseDontNeed(Offset, footprint);

claude · 2026-06-23T07:29:39Z

+        Span<Entry> entries = _entries.AsSpan();
+        if (entries.Length > 0)
+        {
+            byte* keyBase = (byte*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(_keyBuf.AsSpan()));


Low — Raw pointer captured before Sort; safe today but latent footgun

keyBase is a byte* into _keyBuf's backing store, snapshotted before _entries.Sort(...). NativeMemoryList<byte> never grows during Sort (sorting is done on _entries, not _keyBuf), so the pointer is stable. However, this two-step pattern — capture pointer, then mutate the sibling list — is fragile: a future maintainer could trigger a _keyBuf.Add inside the comparator (e.g. for diagnostics) and create a dangling pointer. A comment asserting the invariant would prevent this:

// _keyBuf is not touched during sort — only _entries is reordered. // keyBase remains valid for the lifetime of the sort. byte* keyBase = (byte*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(_keyBuf.AsSpan()));

Shrink the SortedTable index and build bookkeeping ~8×: - Store one offset per 8-record block instead of one per record. Records are now written physically sorted and contiguous, so a lookup binary searches the sparse offsets to a block, then sequentially scans its <=8 records (almost always within one page). The enumerator becomes a straight contiguous walk. - Key and value sizes are a single byte each (keys are <=55 B; all inline values are <255 B, enforced by the builder's checked cast). - Referenced blob-arena ids become one record each (key [0x00][id BE], value [0x01]) in a column that sorts before all others — so they dedup into the union through the normal N-way merge and iterate cheaply from the table start, with no list value and no chunking. Drops MergeRefIds and the ref_ids metadata entry. - Footer carries blockSize; SortedTable format version 1->2; SnapshotCatalog version 2->3 (incompatible on-disk layout, dev DBs wipe-and-resync). Full Nethermind.slnx builds; Nethermind.State.Flat.Test is green, including new block-boundary SortedTableTests cases and the compaction/ref-id round-trips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Each record now stores `[commonPrefix u8][suffixLen u8][keySuffix][vs u8][value]`: only the bytes that differ from the previous key, plus the shared-prefix length. Adjacent sorted keys share long prefixes (per-address slots share `FE<addr>FD` + ~30 slot bytes, storage/state nodes share their column/addrHash prefix, etc.), so this is a large on-disk saving for the dominant key classes. Block-start records (every 8th, reached via the sparse offset) force commonPrefix=0 / full key, so each block decodes standalone. The reader binary-searches block-start full keys then reconstructs in-block keys with a running buffer; the enumerator reconstructs during its contiguous walk. Builder `Add`, the merger, the scanner, and `PersistedSnapshot` are unchanged — keys flow in/out uncompressed, compression is internal to the builder write pass / reader / enumerator. SortedTable format version 2->3; SnapshotCatalog version 3->4 (incompatible record layout, dev DBs wipe-and-resync). Adds a long-shared-prefix SortedTableTests case; full Nethermind.State.Flat.Test stays green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The last caller was removed in 71a2da5 (collapse TwoByteSlot u16/u24 fork), leaving Leb128 as dead code with no remaining references. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…parator index Replace the single-level sparse-offset SortedTable with a LevelDB-style two-level layout: size-bounded 4KB data blocks, each prefixed with a u16 restart table for in-block binary search, plus a tail separator-key index that the first-level lower-bound search operates on. Keeps the first-level search off the scattered data pages and bounds blocks by size rather than a fixed record count. Breaking on-disk change: SortedTable format v3->v4 and catalog v4->v5 (clean break — wipe and resync). The Add/Build/TrySeek/enumerator surface is unchanged; only the internal footer/TryReadFooter shape moved. Extends SortedTableTests (multi-block, restart boundaries, gap/sentinel misses, multi-block enumeration, randomized fuzz). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…rtedTable Extract the per-block encode/decode into a reusable BlockBuilder/BlockReader used for BOTH the data blocks and the top-level index (key = separator, value = u32 block number). Data blocks are 4 KiB-aligned and addressed by block number (byte = blockNumber * 4096), so a u32 reaches a 16 TiB table; the last data block is left unpadded so small tables stay compact (the footer's lastBlockSize locates the index). Each block self-describes its offset width (2 or 4 bytes) via a leading flag, so 4 KiB data blocks use u16 offsets while the multi-MB index block uses u32. Lookups are two ceiling searches (index -> block number -> data block + exact match). Restart interval is now a builder option, default 8 (was 16). Breaking on-disk change: SortedTable format v4->v5 and catalog v5->v6 (clean break - wipe and resync). Add BlockTests (offset-width selection + ceiling edge cases) and extend SortedTableTests (4 KiB alignment, block-number addressing, interval-8 restart boundaries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- SmallRefCountingDisposable: reject a lease acquisition once the count reaches zero (not only Disposing), checked inside the retry loop too — closes the resurrection window between the release path's 1→0 and 0→Disposing CAS. [H1] - PersistedSnapshotMerger.MergeSelfDestruct: skip self-destruct entries whose TryRead fails rather than letting the zero default read as the destructed marker and set a spurious truncation barrier. [H2] - PersistedSnapshotBuilder: throw a descriptive error if a state/storage node disappears between extraction and the write pass, instead of an NRE on node!. [H4] - PersistedSnapshotBuilder.WritePerAddress: return the pooled RLP buffer in a finally so it is not leaked on an exception inside the loop. [M1] - SortedTable.TryReadFooter: bound numBlocks by the table size before the int cast / offset math so a corrupt footer cannot overflow to a negative count. [M2] - PersistedSnapshotMerger.MergeEntries: cap the matching-sources stackalloc and fall back to the heap for an unusually large compaction batch. [M3] - PersistedSnapshotCompactor.EnsureStarted: guard worker startup with a lock so concurrent EnqueueAsync callers cannot spawn duplicate worker sets. [M5] - ArenaReservation.CleanUp: skip the wasted FadviseDontNeed when the file is about to be deleted. [L1] - SortedTableBuilder: document the recordBase-stable-during-sort invariant. [L2] Left intentionally: the merged self-destruct tag (review H3) stays "destructed if any source in the range destructed". Its only value-consumer, PersistenceManager, does `if (SelfDestructFlag is false) batch.SelfDestruct(addr)` before re-applying the barrier-filtered post-destruct slots, so the destruct must be reported to clear stale RocksDB storage for a re-created contract; emitting "new" would leak it. The read path keys off the barrier (presence), not the value. Rationale documented on MergeSelfDestruct. Moot after later refactors: M4 (Leb128 deleted), L4 (enumerator key buffer is now fixed-size). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The persisted-snapshot metadata blob (a SortedTable) could not exceed ~2 GiB because SortedTableBuilder buffered every record in an int-capped NativeMemoryList and sorted at Build. Replace that with a streaming, order-enforcing builder: Add now requires strictly ascending keys and writes each record straight into a 4 KiB data block, so the table size is bounded by the 16 TiB data region instead of an in-memory buffer. The single index block is no longer recomputed from the (int) block count; it is located directly by a new i64 indexOffset in the footer and left unaligned. Footer widened to i64 fields (count, numDataBlocks, indexOffset); reader/enumerator widened int -> long. No multi-level/B-tree index. Producers now emit in global ascending key order (the builder enforces it): - PersistedSnapshotBuilder: ascending column order, storage nodes via a 3-way merge of the fallback/compact/top sublists (path encodings are order-preserving, so list order matches encoded-byte order). - PersistedSnapshotMerger: entries before metadata; metadata in name order; buffered slots flushed at the slot -> self-destruct/account transition so they land in their sorted position (MergeSelfDestruct split into ComputeSelfDestructBarrier + EmitSelfDestruct). EstimateSize drops the 2 GiB Math.Min cap (EstimateMemory bounds the metadata table, which stores only 6-byte NodeRefs, not node RLP). Format break: SortedTable.FormatVersion 5->6, catalog v6->v7, MetadataFormatVersion 0x05->0x06; old blobs are rejected (wipe and resync). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Addresses review comments on IHsstByteReader / IHsstReaderSource ("who uses this?"). They are still the core read seam, but the HSST format was removed earlier in this PR, so the Hsst/ folder now holds only generic byte-IO seams. - IHsstByteReader -> IByteReader, IHsstReaderSource -> IByteReaderSource. - Move the seam types out of the misnamed Hsst/ folder: namespace + folder Nethermind.State.Flat.Hsst -> Nethermind.State.Flat.Io (and the test counterpart Test/Hsst -> Test/Io). ~22 usings updated. - Purge the remaining "HSST" mentions from comments/docs (rephrased to "columnar" / "sorted-table"), and fix stale references to removed types (HsstReader cref, HsstBTreeBuilder, NWayMergePerAddressHsst, LeaseBlobIdsFromHsst). Pure rename + comment cleanup, no behavior change. Full Nethermind.slnx builds; Nethermind.State.Flat.Test green (760 passed, 4 skipped). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

asdacap and others added 30 commits May 27, 2026 20:14

Merge branch 'feat-nway-snapshot-merge' into flat/long-finality

a92714e

Merge branch 'cache-btree-root-index' into flat/long-finality

5a185f1

Merge branch 'verify-snapshot-bloom-coverage' into flat/long-finality

19fc543

asdacap and others added 5 commits June 19, 2026 16:08

github-actions Bot added BREAKING database refactoring state+storage trie labels Jun 23, 2026

claude Bot reviewed Jun 23, 2026

View reviewed changes

asdacap and others added 9 commits June 23, 2026 16:05

refactor: remove unused Leb128 helper

4683f92

The last caller was removed in 71a2da5 (collapse TwoByteSlot u16/u24 fork), leaving Leb128 as dead code with no remaining references. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge branch 'investigate-leb128-usage' into simplify-snapshot-format

a508dd6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(flat): replace persisted-snapshot HSST with single-level sorted table#12100

refactor(flat): replace persisted-snapshot HSST with single-level sorted table#12100
asdacap wants to merge 777 commits into
masterfrom
simplify-snapshot-format

asdacap commented Jun 23, 2026

Uh oh!

claude Bot commented Jun 23, 2026 •

edited

Loading

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

claude Bot Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	if (current == Disposing)
	if (current is Disposing or NoAccessors)

	if (!punched)
	if (!punched && fileSurvives)
	_arenaFile.FadviseDontNeed(Offset, footprint);

Conversation

asdacap commented Jun 23, 2026

Changes

Types of changes

What types of changes does your code introduce?

Testing

Requires testing

If yes, did you write tests?

Notes on testing

Documentation

Requires documentation update

Requires explanation in Release Notes

Remarks

Uh oh!

claude Bot commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review — refactor(flat): replace persisted-snapshot HSST with single-level sorted table

Summary

Critical: 0 | High: 4 | Medium: 5 | Low: 4

High

Medium

Low

Design note

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude Bot commented Jun 23, 2026 •

edited

Loading

PR Review — `refactor(flat): replace persisted-snapshot HSST with single-level sorted table`