Skip to content

refactor(flat): replace persisted-snapshot HSST with single-level sorted table#12100

Draft
asdacap wants to merge 777 commits into
masterfrom
simplify-snapshot-format
Draft

refactor(flat): replace persisted-snapshot HSST with single-level sorted table#12100
asdacap wants to merge 777 commits into
masterfrom
simplify-snapshot-format

Conversation

@asdacap

@asdacap asdacap commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Part of splitting the large flat/long-finality work into reviewable pieces. This PR isolates the persisted-snapshot on-disk format: it removes the nested columnar HSST and replaces it with a single, deliberately-unoptimized, binary-search sorted table. Trie-node RLP still lives in blob arenas as NodeRefs; only the metadata format changes.

Changes

  • New format (PersistedSnapshots/Sorted/): SortedTable / SortedTableBuilder / SortedTableReader / SortedTableEnumerator + FORMAT.md. Interleaved [ks u16][key][vs u16][value] records, a u32 offset index, and a footer. The builder buffers all keys and sorts the offsets at Build; the reader binary searches (O(log N), no caching).
  • Verbose keys (PersistedSnapshotKey): fully-materialized <column><addr><subcolumn><slot> keys with the column/subcolumn tag bytes stored as 255 − tag, so a plain ascending byte sort reproduces the HSST reverse-tag emission order. The builder/compacter ordering and comparers are kept unchanged so a future proper-HSST serializer can reuse them.
  • Rewired PersistedSnapshotBuilder / PersistedSnapshotReader / PersistedSnapshotMerger / PersistedSnapshotScanner / PersistedSnapshot over the sorted table. Dropped AddressBoundCache and the address-BTree-root caching. The merger preserves newest-wins plus per-address self-destruct truncation.
  • Removed all HSST format code (Hsst/{BTree,PackedArray,DenseByteIndex,TwoByteSlot}, the dispatchers, the N-way merge plumbing). Kept the seam interfaces (IHsstByteReader/Bound, IByteBufferWriter, IHsstReaderSource, SpanByteReader, PooledByteBufferWriter), the Storage/ arena layer, bloom, and orchestration.
  • Catalog version bumped 1 → 2: existing long-finality persisted-snapshot directories are unreadable by the new reader and fail fast with a "wipe and resync" message.
  • Deleted the HSST format unit tests; added SortedTableTests.

Types of changes

What types of changes does your code introduce?

  • Breaking change (a change that causes existing functionality not to work as expected)
  • Refactoring

Breaking only for the experimental EnableLongFinality on-disk format (off by default): the persisted_snapshot directory must be wiped.

Testing

Requires testing

  • Yes

If yes, did you write tests?

  • Yes

Notes on testing

Full Nethermind.slnx builds. Nethermind.State.Flat.Test is green — 729 existing tests (4 pre-existing skips) plus 4 new SortedTableTests — including the compactor tests that cover newest-wins and per-address self-destruct truncation.

Documentation

Requires documentation update

  • No

Format documented in-repo at PersistedSnapshots/Sorted/FORMAT.md.

Requires explanation in Release Notes

  • No

Experimental feature, off by default.

Remarks

Intermediate step: the format is intentionally unoptimized (a full record per slot, buffer-all-keys at build). Proper HSST returns in a later PR, reusing the unchanged builder/compacter ordering.

🤖 Generated with Claude Code

asdacap and others added 30 commits May 27, 2026 20:14
…NWayMergeCursor

The per-slot HsstEnumerator was previously a field on each IHsstMergeSource
implementation, retrieved via GetEnumerator(). That smeared two concerns —
"what bytes to merge" (view) and "where in them" (enumerator + bound) — onto
the source struct, and forced every source-construction path
(FromView / MapCursorSource via WithEnumerator / IHsstEnumeratorFactory) to
handle the HSST layout dispatch (tail-byte vs front-byte two-byte-slot).

Reshape:

  - IHsstMergeSource drops GetEnumerator() and IDisposable, gains a `Bound`
    property. The source is now a true minimal "what to merge" pair —
    reader factory + bound.
  - WholeReadSessionMergeSource stores (view, bound); `WithEnumerator` is
    replaced by `WithBound`. FromView resolves the column-tag bound and
    returns a (view, bound) pair, no enumerator construction.
  - IHsstEnumeratorFactory moves out of PersistedSnapshotMerger to a new
    file in the Hsst namespace, generic over <TReader, TPin>.
  - NWayMergeCursor gains a TFactory generic parameter and constructs one
    HsstEnumerator per slot in its ctor, storing them in a caller-supplied
    `Span<HsstEnumerator<TReader, TPin>>`. `MinValue` and `ValueAt` read
    the enumerator directly instead of going through the source.
  - MapCursorSource collapses to a "rewrite the bound" loop — no factory
    parameter, no per-source enumerator construction.

Every cursor consumer (HsstBTreeMerger.NWayMerge / NWayMergeKeyFirst,
HsstPackedArrayMerger.NWayMerge, HsstTwoByteSlotMerger.NWayMerge,
IHsstBTreeValueMerger.OnFastCopy / MergeValues) gains a TFactory generic
parameter for naming the cursor type; the underlying behaviour is
unchanged. Concrete value-merger structs (PerAddressColumnValueMerger,
StorageTrieColumnValueMerger, SlotPrefixValueMerger) commit to
TFactory = TailDispatchEnumeratorFactory in their interface impls.

Every cursor construction site now rents a parallel
ArrayPoolList<HsstEnumerator<…>> alongside its sources buffer; the per-
source Dispose loops are removed (sources no longer carry disposable
enumerators, and HsstEnumerator's Dispose is a no-op anyway).
SlotPrefixValueMergerScratch gains an InnerEnumerators ArrayPoolList for
the inner 2-byte slot-suffix merge's enumerator buffer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Bound

FromView now wraps a view in a source over the entirety of its bytes — a
single-line static factory with no HSST-layer logic. The column-tag scope
resolution moves to a dedicated ResolveColumnBound(view, columnTag) helper
in PersistedSnapshotMerger that returns a Bound. The five seeding loops in
NWayMergeSnapshotsWithViews now use the source's primary constructor with
the pre-resolved bound:

    columnSources[i] = new(views[i], ResolveColumnBound(views[i], TAG));

Same behaviour, clearer separation: FromView is just "view → source", and
column-tag seeking is a single named operation rather than a hidden step
inside the source factory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…uildMergeCursor

After the recent FromView simplification (no more columnTag parameter),
every actual caller migrated to the primary constructor + ResolveColumnBound,
leaving FromView with zero use sites — just a stale doc-cref. Delete it.

MapCursorSource had a single caller (BuildMergeCursor's body) and only
existed because the previous design separated source-mapping from cursor
construction. Fold its three-line for-loop directly into BuildMergeCursor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Eight NativeMemoryListRef rentals in PersistedSnapshotMerger held bounded
Span<int>/Span<Bound> scratch (sized by matchCount or slotCapacity, both
≤ snapshot count N ≤ ~32). Converted to stackalloc:

  - PerAddressColumnValueMerger.MergeValues: perAddrBounds + subTagBounds
  - PerAddressColumnValueMerger.MergeSlots: slotSources + slotBounds
  - StorageTrieColumnValueMerger.MergeValues: perAddrBounds + subTagBounds
  - StorageTrieColumnValueMerger.MergeStorageSubTag: srcs + subBounds

Bound is `record struct(long Offset, long Length)` — unmanaged, so
stackalloc Bound[n] compiles. Per-call stack budget is well under 1 KiB
total across all four stackallocs in any single call frame.

The two NativeMemoryListRef rentals in NWayMetadataMerge stay — their
sizes (totalRefIdsBytes) are unbounded and could plausibly exceed 2 GiB,
which the file comment already calls out.

Adding `scoped` to the relevant Span<Bound>/ReadOnlySpan<Bound> parameter
on ResolvePerAddrAndSubTagBounds and the per-sub-tag method signatures
(MergeSlots/MergeSelfDestruct/MergeAccount/MergeStorageSubTag) is required
so the compiler accepts passing stackalloc spans alongside the scoped ref
HsstDenseByteIndexBuilder parameter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…opy/MergeValues

Only one of three value-merger implementations had a non-empty OnKey body
(PerAddressColumnValueMerger: bloom.Add(addrKey)); the other two were
literally `{ }`. Removing OnKey from the interface and the driver loop
shrinks both ends — the per-address merger gets a single `bloom.Add(addrKey)`
at the start of OnFastCopy and MergeValues; the other two impls lose their
no-op bodies; HsstBTreeMerger's NWayMerge / NWayMergeKeyFirst lose the
trailing `valueMerger.OnKey(cursor.MinKey)` dispatch on every emitted key.

IHsstPackedArrayMergeCallback.OnKey and IHsstTwoByteSlotMergeCallback.OnKey
stay — they're the only callbacks those simpler mergers expose, and their
implementations all do non-trivial per-key work.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…reeNodeLayoutPlanner

Both parameters of PlanFromProfile were write-only: callers passed values
that were never read by any code path, only re-assigned by the slot-widening
block. Removing them lets Plan skip the secondLen/allSameLenExceptFirst
bookkeeping in its lengths-scan loop.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…BTreeBuilder.BuildIndex

BuildIndex no longer runs a leaf phase: EmitInlineLeaf and WrapLoneEntryAsLeaf
emit any leaves before BuildIndex runs and call EnsureSize themselves. The
minLeafEntries parameter was unread; maxLeafEntries only inflated the
ValueScratch pre-size for nodes BuildIndex doesn't write. Also clear up two
stale comments referencing the removed OpenReader / LeafBoundaryEnumerator /
PrecomputeCommonPrefixLengths code paths.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…and stale members

Removed:
- IByteBufferWriter.Copy<TWriter,TReader,TPin>(...) 3-param reader-bulk-copy overload —
  every caller uses the 2-param Span overload.
- IHsstByteReader<TPin>.Bound { get; } and all 8 implementations (SpanByteReader,
  PooledByteBufferWriter.Reader, ArenaBufferReader, ArenaByteReader, WholeReadSessionReader,
  MmapByteReader test impl, HsstReaderTests CopyOnlyByteReader, HsstDenseByteIndexTests
  trailer/spec-stage readers). Property had no readers.
- KeyValueEntry.KeyLength property and constructor parameter — every consumer uses only
  ValueBound; the current key length is available via HsstEnumerator.CurrentKeyLength on
  callers that need it.
- HsstPackedArrayBuilder.WriteLeb128 — orphan private method.
- HsstTwoByteSlotValueReader.TryResolveAll and HsstTwoByteSlotValueLargeReader.TryResolveAll
  — only HsstDenseByteIndexReader.TryResolveAll is invoked across the codebase.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Every address-bound cache miss in TryGetAddressBound pays two reads
in HsstBTreeReader.TrySeek before stepping onto the root node: the
5-byte trailer (RootPrefixLen / RootSize / KeyLength) and the
variable-length root prefix. The address column is immutable for the
life of a snapshot, so those values are too. Resolve the address
column bound and read the trailer + root prefix once at construction,
then route the miss path through a new TrySeekFromRoot entry point
that takes the precomputed root descriptor and starts the walk
directly. Length == 0 on the cached bound is the sentinel for
"snapshot has no address column" and short-circuits the lookup to
"no entry".

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
PersistedSnapshotUtils.cs: remove `// 1. Accounts` / `// Deserialize X`
labels that just restated the loop variable. BloomFilterTests.cs: drop
AAA banners from a 4-line test. PersistenceManagerTests.cs: trim
`Bug A/B regression` prefixes and `OLD ... was removed` / `previously
skipped` framing — keep the current invariants each test asserts.
HsstBTreeBuilder{,Buffers}.cs: drop `(formerly HsstIndexBuilder)` and
`Previously stackalloc'd / Promoted to` framing; describe the current
pooled design directly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nch path

Bug: ArenaReservation.PersistOnShutdown() was a 1-line forwarder to
_arenaFile.PersistOnShutdown() — it set no per-reservation flag. So when
ArenaReservation.CleanUp ran on a snapshot inside Repo.Dispose, the
punch-hole reclaim path was gated only on MarkDead's structural return
value (sibling reservations still live in this file?). In a shared arena
where snap1 is disposed while snap2 still holds a reservation, MarkDead
returns true and TryPunchHole zeros snap1's range in the
preserve-flagged-but-still-live file. The file-level _preserveOnDispose
flag only suppressed File.Delete in ArenaFile.CleanUp; it had no effect
on the punch.

Fix: add _preserveOnDispose to ArenaReservation (same set-once
Interlocked.Exchange pattern as ArenaFile), set it in PersistOnShutdown
before forwarding to the file, gate TryPunchHole on it in CleanUp.
Reservation-level intent is now symmetric with file-level intent: both
layers honor the same opt-in.

Pruning path unaffected — pruned snapshots never call PersistOnShutdown,
so the flag stays 0 and TryPunchHole runs as before, reclaiming the
disk blocks of pruned reservations.

Test: parameterize Repository_Restart_PreservesAllData with two
maxArenaSize values. 4 KiB (each snapshot's metadata reservation
page-rounds to fill its arena file → MarkDead fully-dies on the sole
reservation → no punch path → masked the bug pre-fix). 1 MiB (both
snapshots' reservations pack into one arena → MarkDead returns true on
the first dispose → reproduces the bug pre-fix). Before this fix the
1 MiB variant failed on
  Assert.That(snap2.TryLoadStateNodeRlp(path2, ...), Is.True)
because snap2's metadata HSST was zeroed. After this fix both variants
pass.

Full Nethermind.State.Flat.Test → 870/870 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ource persist

Sibling to the existing AddToPersistence_InMemoryPersist_PrunesPersistedTier
for the persistedToPersist branch (line 426-432 of PersistenceManager).
A tier-source persist must also drive PruneBefore so the in-memory tier
doesn't keep growing with entries that RocksDB now supersedes.

Setup mirrors the existing DetermineSnapshotAction_FinalizedNoInMemory_
FallsBackToPersistedSnapshot test: empty arena reservation wrapped in a
PersistedSnapshot, stubbed TryLeaseSnapshotTo + LeaseBaseSnapshotsInRange
to feed DetermineSnapshotAction's tier fallback path.

Test passes — confirms the persistedToPersist branch correctly invokes
_repo.PruneBefore(persistedToPersist.To) after PersistPersistedSnapshot.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Centralises every boundary-math site in the persistence pipeline behind
ICompactionSchedule so the per-instance offset (used to stagger
multi-node deployments) reaches every layer that decides when to compact.

Pre-fix: only SnapshotCompactor honoured the offset. PersistenceManager
accepted ICompactionSchedule as a ctor dep but never called a method on
it. PersistedSnapshotCompactor didn't even take the dep. Both
re-implemented the alignment math inline (`b & -b`, `b % CompactSize`),
defeating the schedule's anti-lockstep design intent at the persisted
tier.

Interface gains three intent-revealing methods (no parameter overload —
one method per use case):

  bool IsFullCompactionBoundary(long blockNumber);
  long GetHierarchicalCompactSize(long blockNumber);
  bool IsHierarchicalBoundary(long blockNumber);

All three reuse a private ShiftedAlignment helper inside
CompactionSchedule that applies the offset once. Unlike GetCompactSize /
NextFullCompactionAfter, the three new methods do NOT short-circuit on
`_compactSize <= 1` — PersistedSnapshotCompactor runs with its own
min/max caps independent of config.CompactSize and the tests construct
that edge case.

Migrated call sites:
  PersistedSnapshotCompactor.cs:57 → GetHierarchicalCompactSize
  PersistedSnapshotCompactor.cs:72 → IsFullCompactionBoundary
  PersistenceManager.cs:120 → IsFullCompactionBoundary
  PersistenceManager.cs:129 → GetHierarchicalCompactSize
  PersistenceManager.cs:150 → IsHierarchicalBoundary
  PersistenceManager.cs:237/244 → NextFullCompactionAfter (existing method)

PersistedSnapshotCompactor's ctor gains an ICompactionSchedule param
(wired in FlatWorldStateModule + 4 test fixtures). PersistenceManager's
dead-code xmldoc on _schedule is removed. _compactSize field stays for
span-width checks (lines 238/321/379/601 measure snapshot extents, not
boundary decisions, and don't need the offset).

Tests: 3 parameterized tests added to CompactionScheduleTests covering
the new methods at offset 0 and 3. Full Nethermind.State.Flat.Test ->
892/892 pass + 7 pre-existing skips.

Sanity grep: after this commit, no production file under
Nethermind.State.Flat/ contains the inline `b & -b` or `% _compactSize`
expressions except a comment in CompactionSchedule.cs itself.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…anedFrontier

Sibling to the ArenaReservation punch-on-preserve bug (8ace70c).
PersistedSnapshot.CleanUp drops its blob lease then calls
TryResetOrphanedFrontier when the file has only the manager's lease
left — which is the case for every blob the last surviving snapshot
referenced during repo dispose. TryResetOrphanedFrontier punched the
WHOLE [0, prev) range without consulting the file's _preserveOnDispose
flag, so blobs the next session needs to rehydrate would survive on
disk (BlobArenaFile.CleanUp honours the flag for File.Delete) but their
contents would all read as zeros.

Symptom in a real workload: 32 blob frontier-resets logged as
punched=True followed by 32 BlobArenaFile.CleanUp with preserved=True
— all 32 files kept on disk, all 32 zeroed.

Fix: add BlobArenaFile.IsShutdownPreserved accessor (mirrors the
existing ArenaFile shape), gate TryResetOrphanedFrontier on it.
Pruning path unaffected — pruned snapshots never call PersistOnShutdown,
so the flag stays 0 and the orphan-frontier reclaim still punches the
blob to free disk blocks.

Regression test: extend Repository_Restart_PreservesAllData to write
10 valid-RLP-framed state nodes per snapshot (~500 bytes each →
cumulative blob frontier > 4 KiB so the punch over [0, frontier)
actually zeros the first OS page; sub-page punches are no-ops on tmpfs
and would have let the test silently pass with the bug present). Each
node round-trips through TryLoadStateNodeRlp post-reload. Pre-fix the
test fails with "Expected is Byte[503], actual is Byte[1]" — the
punched RLP prefix reads as zero, the reader interprets it as RLP
empty-string and returns a single-byte array.

Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replaces pre-extend-then-punch-hole with start-at-0-then-truncate so the
blob file's on-disk length always tracks the actual data extent.

Before: BlobArenaFile ctor pre-extended every new file to MaxSize via
sparse ftruncate, and TryResetOrphanedFrontier reclaimed via
fallocate(PUNCH_HOLE) + posix_fadvise(DONTNEED). Two side effects:
  1. FileInfo.Length == MaxSize always, diverging from Frontier from
     creation onward.
  2. Initialize's `frontier = FileInfo.Length` restore therefore reported
     MaxSize for any file that had been written to. The headroom check
     `len < _maxFileSize` was always false, so restored files never
     re-entered _mutableFiles — packing-reuse was strictly in-process.
The recently-fixed punch-on-preserve bug (8a78319) was made worse
because `prev = ReportedFrontier = MaxSize` meant the buggy punch wiped
the entire file's actual data, not just the originally-written prefix.

Now: BlobArenaFile ctor skips the pre-extension (FileStream.Write
auto-extends, and BlobArenaWriter's 1 MiB internal buffer amortises
growth syscalls anyway). TryResetOrphanedFrontier calls SetFileLength(0)
— one syscall that zeros the logical length AND frees all disk blocks
AND implicitly invalidates the page cache for the range. No fadvise
fallback needed.

The IsShutdownPreserved guard at the top of TryResetOrphanedFrontier
stays — truncating a preserve-flagged file is strictly worse than
punching it (file length 0 → NodeRef offsets point past EOF → reads
fail outright).

Side benefit: restored files now correctly report headroom via
`len < _maxFileSize`, so packing-reuse survives restarts.

Dead code removed: BlobArenaFile.PunchHole, BlobArenaManager's
_punchHoleOnReclaim field + _punchHoleSupported field +
PunchHoleSupported property + punchHoleOnReclaim ctor param +
PersistedSnapshotPunchHoleEnabledByTier[_tier] write. The shared config
flag PersistedSnapshotPunchHoleOnReclaim stays — still consumed by
ArenaManager (arena side untouched).

Tests:
  - ArenaReclaimPunchHoleTests.BlobFrontierReset_PunchesHole_*
    rewritten as BlobFrontierReset_TruncatesFile_ForOrphanedRange —
    single non-parameterized test that asserts FileInfo.Length == 0
    after TryResetOrphanedFrontier.
  - Repository_Restart_PreservesAllData gains an inter-session
    blob-length check: every preserved blob_*.bin must have
    0 < length <= MaxSize after Repo.Dispose. Catches a regression to
    truncating-on-preserve (length 0) OR pre-extending (length > MaxSize).
  - BlobArenaFile.FadviseDontNeed stays (still used by
    PersistedSnapshot.AdviseDontNeedBlobRange's read-side cache hint).

Full Nethermind.State.Flat.Test -> 891/891 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…nvert/compact

A persisted snapshot's bytes lived only in the OS page cache between the
writer's Complete and the next persistence-layer checkpoint. A crash in
that window left the catalog (committed via the RocksDB-backed
SnapshotCatalog) pointing at pages that were never durably on disk —
session 2 would Initialize and TryLeaseSnapshotTo successfully but reads
would return whatever the kernel had not yet flushed.

Fix: fsync the metadata arena (always) and the blob arena (base
snapshots only) between writer.Complete and _catalog.Add. Catalog never
records an entry that references unsynced pages.

Convert path (PersistedSnapshotRepository.ConvertSnapshotToPersistedSnapshot):
both reservation.Fsync() + blobWriter.Fsync(). Base snapshots write
trie-node RLPs to a blob arena, so both files need the barrier.

Compact path (PersistedSnapshotCompactor.CompactRange): reservation.Fsync()
only. Compaction emits NodeRefs into the existing base blob arenas; it
writes no new blob bytes, and those base blobs were fsynced when their
respective base snapshots were converted.

Plumbing:
  - PosixReclaim.Fsync(int fd) — libc fsync(2), no-op on non-Linux for
    cross-platform test runs, throws IOException on errno.
  - ArenaFile.Fsync() / BlobArenaFile.Fsync() — thin wrappers.
  - ArenaReservation.Fsync() — forwards to the underlying arena file
    (the post-Complete handle the convert/compact caller already holds).
  - BlobArenaWriter.Fsync() — forwards to its file; asserts the writer
    has been Complete'd so the managed buffer is flushed first.

Full Nethermind.State.Flat.Test -> 891/891 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t durability barrier

fdatasync skips the mtime/ctime flush that fsync would do but still
flushes the file size — which is the only metadata change the auto-grown
blob file produces and the only metadata our reads depend on (catalog
RocksDB has its own durability barrier; we only need the arena/blob
file's data + size to be on disk before the catalog records the entry).

On ext4 the journal entry is smaller and the commit can occasionally be
skipped when only mtime would have been dirty, giving ~10-30% savings
per durability barrier. Drop-in: PosixReclaim.Fsync still surfaces as
.Fsync() everywhere (ArenaFile, BlobArenaFile, ArenaReservation,
BlobArenaWriter) — just the underlying syscall changed.

Full Nethermind.State.Flat.Test -> 891/891 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…arker

The auto-grow design (e5a3594) made FileInfo.Length the frontier
source, which is unreliable under torn writes: a crash mid-FlushBuffer
leaves bytes past the writer's last-committed frontier; restart treats
those uncommitted bytes as valid data and readers parse trailing partial
RLPs as garbage (or surface as RlpException). The earlier
"restart-seals-each-file" symptom that originally motivated dropping
pre-extension was actually caused by FileInfo.Length-based frontier
restore being conflated with on-disk file size — separable concerns.

Fix:
  1. Restore pre-extension to MaxSize at file creation (sparse ftruncate).
     The blob writer never has to grow the file inline; disk blocks
     remain lazily allocated.
  2. Reserve the first 4 bytes of every blob file as an on-disk frontier
     marker (int32 LE = absolute next-write offset). BlobArenaWriter.Complete
     publishes the new frontier into the marker; BlobArenaManager.Initialize
     reads it. Frontier recovery is decoupled from FileInfo.Length.
  3. Reset becomes punch-hole + marker reset (NOT truncate). Punching
     [HeaderSize, prev_frontier) reclaims disk blocks while the file's
     logical length stays at MaxSize; the marker resets to HeaderSize
     so restart sees "no data". Marker reset is sequenced BEFORE the
     punch — a crash between the two leaves a consistent fresh-marker /
     stale-data state instead of a stale-marker / zeroed-data state.
  4. Headroom check in Initialize uses Frontier (marker-derived), not
     FileInfo.Length (always MaxSize after pre-extension) — restored
     files now correctly re-enter the packing pool.

NodeRef.RlpDataOffset semantics unchanged — still a file-absolute byte
offset. Fresh-file offsets begin at HeaderSize=4 instead of 0. Readers
use whatever offset the writer returned; no read-path change. BlobRange
shifts automatically through blobWriter.StartOffset/Written.

The IsShutdownPreserved guard at the top of TryResetOrphanedFrontier
stays unchanged — without it, a preserve-flagged file's data range
would still be punched on the last lease drop. Marker reset would also
fire, leaving the file with a fresh marker pointing at zeroed data,
which is strictly worse than the original sparse-hole symptom.

Tests:
  - BlobFrontierReset_TruncatesFile_ForOrphanedRange rewritten as
    BlobFrontierReset_PunchesHoleAndResetsMarker_ForOrphanedRange:
    file length unchanged at MaxSize, marker reset to HeaderSize,
    disk blocks freed (StatBlocks).
  - LongFinalityIntegrationTests.Repository_Restart_PreservesAllData
    tightens blob-length assertion to == MaxSize.
  - New regression: Repository_Restart_IgnoresTornWritePastFrontierMarker
    writes a snapshot, fsyncs (marker durable), then manually appends
    4 KiB of garbage past the marker, reloads, asserts the snapshot
    round-trips intact and Frontier matches the marker (not file length).
    Verified pre-fix this test fails (Frontier becomes len, garbage is
    interpreted as committed data).

Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Resolved using-directives conflict in PersistenceManagerTests.cs (kept
  our System.Threading.Tasks for async TearDown; dropped the unused
  FluentAssertions import).
- Migrated 7 branch-side test files from FluentAssertions to NUnit
  Assert.That style, matching master's branch-wide "Unify assertions"
  (ba5a917) cleanup which removed the FluentAssertions package:
    ArenaReclaimPunchHoleTests.cs
    ArenaMetricsTests.cs
    ArenaManagerEvictionQueueTests.cs
    ArenaBufferWriterReaderTests.cs
    PageResidencyTrackerTests.cs
    ArenaManagerForgetOnAdviseTests.cs
    Persistence/BloomFilter/BloomFilterTests.cs

  Mapping: .Should().Be(x) -> Assert.That(., Is.EqualTo(x)),
  .Should().Be{True,False,Null,Empty}() -> Assert.That(., Is.{True,False,Null,Empty}),
  .Should().Be{Greater,Less}Than(x) -> Assert.That(., Is.{GreaterThan,LessThan}(x)),
  .Should().HaveCount(n) -> Assert.That(., Has.Count.EqualTo(n)),
  .Should().Equal(a, b, c) -> Assert.That(., Is.EqualTo(new[] { a, b, c })),
  .Should().Contain(x) -> Assert.That(., Does.Contain(x)),
  .Should().Throw<E>() -> Assert.That(., Throws.TypeOf<E>()),
  .Should().ContainSingle().Which.Should().Be(x) ->
    Assert.That(., Is.EqualTo(new[] { x })).

Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ersisted

Reorders the four child edges explored from each BFS node:

  before:  0 in-mem compacted, 1 persisted compacted, 2 persisted base, 3 in-mem base
  after:   0 in-mem compacted, 1 in-mem base,         2 persisted compacted, 3 persisted base

The previous order had the in-mem base last, breaking the "no-disk-read wins
among ties in width" principle that the compacted-tier ordering already
followed. Combined with the gating at line 90
  (`if (currentPersisted && edgeIsInMemory) continue;`)
the old order caused the BFS to commit to the persisted tier the first time
both an in-mem and a persisted base existed for the same `to` — typically
during a convert window — barring any wider in-mem compacted skip-pointer
that might have existed downstream and produced a shorter overall bundle.

New rule (now consistent): prefer in-RAM over disk; among ties prefer wider.
This also keeps the BFS option-set open for the remainder of the traversal
in the rare both-tiers-overlap case.

No new tests — the existing 892-test suite passes unchanged, confirming no
test relied on the old persisted-base-first order at a tier-overlap point.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ule offset

The alignment value comes from _schedule.GetHierarchicalCompactSize which
applies the per-instance offset (ShiftedAlignment: `(b + _offset) & -(b + _offset)`).
The compaction trigger fires when this offset-shifted alignment hits the
desired tier — i.e. the window the trigger selects is
(snapshotTo - alignment, snapshotTo] in raw block-number space.

The previous formula `((blockNumber - 1) / alignment) * alignment` is
"largest multiple of alignment ≤ blockNumber - 1", which equals
`blockNumber - alignment` ONLY when blockNumber mod alignment == 0 — i.e.
when offset == 0. With a non-zero offset, blockNumber mod alignment is
some non-zero value (call it k), so the formula gives:
  startingBlockNumber = blockNumber - k
producing an output span of k blocks instead of the alignment-sized window
the trigger actually selected.

Observed in production logs: requestedCompactSize=128 at to=23448506 with
a non-zero offset produced startingBlock=23448448 (span 58 instead of 128).
Same shape for P10, P26 — all (block mod alignment) under the configured
offset. The compacted output's actual block range is wrong, so the
hierarchical compaction tier accumulates short windows that don't
correspond to any real CompactSize-aligned coverage.

Fix: replace with the correct window left edge.
  startingBlockNumber = blockNumber - alignment

DoCompactPersistable's `blockNumber - _compactSize` was already correct —
it never used the buggy formula — so this only affects the sub-CompactSize
hierarchical-merge path.

Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…fset bug

Pins the fix in 718b2dc: with a non-zero schedule offset, the previous
((blockNumber - 1) / alignment) * alignment formula produced a window
shorter than the alignment that triggered it. The new
(blockNumber - alignment) gives the correct (snapshotTo - alignment, snapshotTo]
window the offset-shifted trigger actually selects.

Geometry: offset=3, CompactSize=64, maxCompactSize=32. At block 45,
(45 + 3) & -(45 + 3) = 16 fires alignment=16. Window must be (29, 45]
(span 16, From.BlockNumber=29). Pre-fix produces (32, 45] (span 13,
From.BlockNumber=32) — verified by temporarily reverting the formula
and confirming the new test fails with "Expected: 29, But was: 32"
before re-applying the fix.

Full Nethermind.State.Flat.Test -> 893/893 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…ontier marker"

This reverts commit 9829572.

Restores the auto-expanding blob design (e5a3594):
- BlobArenaFile starts at length 0; FileStream.Write auto-extends.
- No 4-byte on-disk frontier marker.
- BlobArenaManager.Initialize restores frontier from FileInfo.Length.
- TryResetOrphanedFrontier truncates to 0 (SetFileLength(0)) — no
  punch-hole, no PunchHole method on BlobArenaFile, no
  PunchHoleSupported on BlobArenaManager.
- The IsShutdownPreserved guard at the top of TryResetOrphanedFrontier
  stays (added in 8a78319, predates the marker commit).
- The torn-write regression test (Repository_Restart_IgnoresTornWrite-
  PastFrontierMarker) added by the marker commit is also removed.

Conflict resolution: ArenaReclaimPunchHoleTests.cs's BlobFrontierReset
test body merged using the parent's assertion logic but with the master
merge's NUnit Assert.That style (FluentAssertions is gone branch-wide
since ba5a917).

Full Nethermind.State.Flat.Test -> 892/892 + 7 pre-existing skips.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
GatherReadOnlySnapshotBundle leased one PersistedSnapshotBloom per
persisted snapshot keyed only by To. A registration race with compaction
can leave a narrower bloom (e.g. a base bloom) at a wider compacted
snapshot's To slot. The bundle uses the bloom as a negative read filter,
so an under-covering bloom returns false for keys touched in the
uncovered block range, silently skipping the snapshot and yielding
stale/missing state.

Add a LeaseOrSentinel(from, to) overload that only leases when the
entry's From covers the requested range, falling back to AlwaysTrue
otherwise, and call it from the bundle method. Mirrors the existing
silent sentinel-on-miss behavior.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nstructBloom

LoadFromCatalog used to build a real bloom for every loaded snapshot inline —
one HSST scan per catalog entry on every process start, even though most of
those per-base blooms are immediately superseded by a wider compacted/persistable
bloom that exists in the same catalog.

Drop the inline build. After all snapshots are routed into their buckets, do
one pass that walks the union of every bucket's To set newest→oldest, picks the
widest snapshot at each uncovered slot (range-compare across all three buckets,
nullable base for slots where only a compacted/persistable survived), builds
its bloom, and registers it. Register's range walk then fills every covered
slot in one shot; subsequent (lower-block) iterations skip via ContainsSlot.

PersistedSnapshotBloomFilterManager.Register gains an optional parentLookup
delegate so a compacted-on-unpopulated-slot registration (only reachable from
ReconstructBloom — runtime callers don't pass it) can synthesize the chain
entry by stepping back through the predecessor base StateId graph, instead of
breaking the way the original "compacted on unpopulated key: stop" guard does
for the runtime compaction path.

Net effect: bloom build count on startup drops from "one per snapshot" to "one
per maximal-covering snapshot in the graph"; manager end-state matches the
runtime end-state after a long-running session's compactions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s per state

The SnapshotCatalog key was 40 bytes (To.BlockNumber + To.StateRoot), with the
SnapshotKind and From only in the value. When base + persistable share a To
(the persistable assembled from a base's window ends at the same block as the
last base), the second Add overwrites the first in the underlying IDb; on
restart only the last-written kind survives at that To. The bloom test added
in the previous commit surfaced this by observing the lost base after reload.

Extend the catalog key with the snapshot's depth (To.BlockNumber -
From.BlockNumber) as 8 bytes big-endian — preserving the existing all-ascending
lex order. The (To.BlockNumber, To.StateRoot, depth) triple is globally unique
by construction: base has depth=1, persistable has depth=CompactSize,
sub-CompactSize compacted has depth ∈ {2, 4, …, CompactSize/2} (depth=1 is
excluded since a merge has ≥2 sources, depth=CompactSize is explicitly skipped
in DoCompact), and hierarchical compacted has depth ∈ {2·CompactSize, …}.

Catalog version bumps v6 → v7 — old directories trigger the existing
wipe-and-resync error in Load(). Metadata key length 4 stays distinct from the
new 48-byte entry key, so length-based discrimination in Load() is unchanged.

Add(entry) derives depth internally so production call sites in
ConvertSnapshotToPersistedSnapshot and AddCompactedSnapshot need no edit. The
read/remove/update side (Remove, Find, UpdateLocation) gains a depth parameter
because those callers don't carry a full CatalogEntry; PruneBucketBeforeLocked
captures depth from the still-alive snapshot before Dispose, then passes it to
RemoveFromCatalog so the catalog removal scopes to this bucket's entry alone.

Tests:
- SnapshotCatalog_SaveLoad_RoundTrips extended to write three entries at the
  same To with depths 1/2/4 (base / compacted / persistable) plus a tail
  entry; all four round-trip distinct via Find(to, depth)
- SnapshotCatalog_Remove_And_Find / UpdateLocation updated to the new
  signatures; adds a case "remove one depth at a To leaves the sibling intact"
- LoadFromCatalog_ReconstructsBloom_FromWidestCoveringSnapshot strengthened:
  base at the persistable's To now lease-succeeds after reload alongside the
  persistable
- New LoadFromCatalog_RoundTripsBaseAndPersistableAtSameTo: 4 bases + 1
  persistable in session 1, SnapshotCount==5 in session 2 (was 4 pre-fix)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…gress logging

For a multi-thousand-entry catalog, startup was serial and silent:
LoadFromCatalog opened an arena reservation + leased blob arenas + constructed
a PersistedSnapshot per entry, then ReconstructBloom built (and registered) one
bloom per picked snapshot. All on the loading thread, all under _catalogLock,
no progress reported.

Both phases have plenty of independent work and the underlying primitives are
already concurrency-safe (ConcurrentDictionary buckets, lock-free
ArenaManager.Open / BlobArenaManager.TryLeaseFile, CAS-based
PersistedSnapshotBloomFilterManager.Register with TryAdd / TryUpdate retry).
The only non-concurrent state is the three SortedSet<StateId> ordered ids plus
_lastRegisteredState.

Refactor:

LoadFromCatalog
- LoadSnapshot strips the SortedSet registration — it now only mutates the
  ConcurrentDictionary bucket and Interlocked metric counters and is safe under
  Parallel.ForEach.
- New private LoadSnapshotsParallel runs the Parallel.ForEach with an optional
  ProgressLogger + System.Timers.Timer heartbeat (every 1s), gated on
  ParallelLoadThreshold = 1024 entries (and _logger.IsInfo).
- After the parallel section, a serial post-pass walks the catalog-sorted
  entries and calls RegisterStateIdLocked into the right SortedSet — the
  ordering puts _lastRegisteredState on the highest registered StateId
  naturally, no ComputeLastRegisteredLocked() needed.

ReconstructBloom
- Phase A (serial, cheap): walk the union of every bucket's To ids
  newest→oldest, simulating Register's chain walk locally via a
  HashSet<StateId> coveredSlots set (SimulateRegisterFill mirrors Register's
  cur.BlockNumber > fromBlock exit and parentLookup-stepping so the simulated
  fill matches Register's actual fill, including the early termination when
  parentLookup returns default past a pruned gap). Collects picks in
  newest→oldest order.
- Phase B: picks.Reverse() — the older end of allTos is where the wider
  persistables and hierarchical merges accumulate; putting them first in the
  parallel queue lets Parallel.ForEach's partitioner give the long-running
  bloom-builds to threads first (LPT scheduling) so a single big bloom doesn't
  dominate the tail.
- Phase C (parallel): per pick, build the bloom via WholeReadSession +
  PersistedSnapshotBloomBuilder.Build, then Register. Picks have disjoint slot
  ranges by construction (Phase A's coveredSlots), so parallel Register calls
  touch different _blooms keys. Same ParallelLoadThreshold gates the progress
  logger + heartbeat.

Constructor change
PersistedSnapshotRepository ctor accepts an ILogManager (mirroring
PersistedSnapshotCompactor's pattern). FlatWorldStateModule resolves it from
the container; all test and benchmark call sites pass LimboLogs.Instance.

Tests
- New LoadFromCatalog_Parallel_PreservesOrderingAndDicts: 32 bases + 2
  persistables (CompactSize=8) in session 1, reload in session 2 — asserts
  SnapshotCount, per-bucket TryLease*To for every entry, the From-chain walk
  via TryGetSnapshotFrom, and bloom-slot identity across a persistable's
  range. Stays below the threshold so the gate stays a one-line by-inspection
  branch.
- All existing PersistedSnapshotRepository(...) call sites threaded through
  LimboLogs.Instance.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
asdacap and others added 5 commits June 19, 2026 16:08
Revert the demote-to-AlwaysTrue path: a demoted sub-CompactSize intermediate
again keeps its real merged bloom (Demote only advises its pages cold). That
optimization was too narrow — it freed only the sub-CompactSize intermediates'
blooms and at the cost of all pre-filtering (the AlwaysTrue sentinel).

Instead, when a >CompactSize merge produces a PersistedLargeCompacted snapshot
over the canonical range (from, to], its merged bloom is a correct superset
pre-filter for every persisted snapshot fully contained there. Walk that range
block-by-block by parent across all four buckets and adopt the single big bloom
into each contained snapshot, freeing its own (multi-MiB) bloom while still
pre-filtering. Bounded by persistence lag, off the hot path.

Sharing one bloom safely requires:
- A separate RefCountedBloomFilter wrapper (SmallRefCountingDisposable) owns the
  lease count and the bloom-memory metric, leaving BloomFilter a pure data
  structure. The wrapped filter frees only once the big snapshot and every twin
  (and their in-flight readers) release it — a reader holding a snapshot lease
  keeps the wrapper, hence the filter, alive regardless of prune order. A shared
  filter is counted in the metric once.
- Replacement goes through re-registration (a twin over the same reservation),
  not in-place mutation: MightContain is a lock-free check-then-use on the data
  pointer, so swapping+freeing a live snapshot's bloom is a use-after-free. The
  old instance keeps its own bloom until its refcount drains.

The per-snapshot bloom is therefore immutable — always supplied to the ctor.
The reload path no longer mutates it (SetBloom is gone); ReconstructBloom builds
each snapshot's real bloom and re-registers an equivalent snapshot over the same
reservation carrying it, replacing the AlwaysTrue placeholder.

Snapshots extending below `from` are skipped (the big bloom is not a superset of
their keys — sharing would cause false negatives). Pure live-memory optimization:
blooms are not persisted, so reload rebuilds independent blooms.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ReconstructBloom built an independent bloom for every loaded snapshot, so a
restart re-inflated bloom memory (and startup CPU) versus the running state —
where, after a large compaction, contained snapshots share the wide skip-pointer's
bloom and free their own. This restores the shared-bloom-on-load behavior removed
by 0fd5067, now driven by the runtime ShareBloomAcrossRange.

Reload now assembles the widest-first persisted chain with the main read-path
AssembleSnapshots (from the head down to the committed base = the oldest loaded
snapshot's From), then rebuilds a bloom only for each (widest) snapshot in that
chain — in parallel — and shares each across its range. The contained narrower
snapshots adopt it instead of carrying their own; snapshots no widest one covers
keep the AlwaysTrue placeholder (correct, just unfiltered). The chain ranges are
disjoint, so the parallel shares don't collide. Confined to the loader — no new
repository/interface surface.

ShareBloomAcrossRange blooms the widest (starting) snapshot itself: at `to` it
re-registers the entry in every tier, and the placeholder bloom isn't the fresh
one so it isn't skipped (unlike the runtime path, where the big snapshot already
carries the bloom and is skipped via ReferenceEquals).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A large compaction adopts one bloom across the snapshots it contains, so a
later compaction can assemble several sources pointing at the same filter,
each reporting the whole window's key count. Summing per source inflated
bloomCapacity (and the allocated merged filter) by the number of sharers.
Dedup by bloom owner so a shared filter is counted exactly once.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add PersistedSnapshotBloomCount, incremented/decremented in the RefCountedBloomFilter ctor/CleanUp alongside the existing bloom-memory accounting. Keyed to the wrapper, a bloom shared across snapshots counts once, so reading it against the active-snapshot count measures large-compaction bloom-sharing effectiveness. Marked [DetailedMetric].

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted table

Split the HSST on-disk format out of flat/long-finality: persisted-snapshot
metadata is now one deliberately-unoptimized, single-level binary-search
SortedTable instead of the nested columnar HSST. Trie-node RLP still lives in
blob arenas as NodeRefs.

- Add Sorted/{SortedTable,SortedTableBuilder,SortedTableReader,
  SortedTableEnumerator} + FORMAT.md. Interleaved [ks][key][vs][value] records,
  a u32 offset index, and a footer; the builder buffers all keys and sorts the
  offsets at Build, the reader binary searches.
- PersistedSnapshotKey materializes verbose <column><addr><subcolumn><slot> keys
  with the column/subcolumn tag bytes stored as 255-tag, so a plain ascending
  sort reproduces the HSST reverse-tag emission order. The builder/compacter
  ordering and comparers are unchanged for a future proper-HSST swap.
- Rewrite builder/reader/merger/scanner and PersistedSnapshot over the sorted
  table; drop AddressBoundCache and the BTree-root caching. The merger keeps
  newest-wins plus per-address self-destruct truncation.
- Remove all HSST format code (BTree/PackedArray/DenseByteIndex/TwoByteSlot,
  dispatchers, merge plumbing); keep the seam interfaces, the Storage arena
  layer, bloom, and orchestration.
- Bump SnapshotCatalog version 1->2 (old HSST dirs require wipe-and-resync).
- Delete the HSST format unit tests; add SortedTableTests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@claude

claude Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Claude finished @asdacap's task in 10m 4s —— View job


PR Review — refactor(flat): replace persisted-snapshot HSST with single-level sorted table

  • Gather context and understand the PR
  • Review core sorted table format implementation
  • Review persisted snapshot builder/reader/merger/scanner
  • Review storage layer (arena, blob arena, catalog)
  • Review key encoding and ordering correctness
  • Review resource management and dispose patterns
  • Review tests
  • Post inline comments and final summary

Summary

The PR cleanly replaces the nested HSST columnar format with a simpler single-level sorted table backed by a binary-search reader, and the overall design is sound. The new SortedTable wire format is well-specified and the FORMAT.md is a good addition. The arena/blob-arena storage layer is sophisticated but well-structured. However, four High issues need to be resolved before merge:


Critical: 0 | High: 4 | Medium: 5 | Low: 4


High

H1 — SmallRefCountingDisposable: resurrection race window (line 41)

Between the CAS that drives _leases from 1 → 0 and the second CAS that drives 0 → Disposing, a concurrent TryAcquireLease sees _leases == 0 (NoAccessors), passes the == Disposing guard, and successfully CASes 0 → 1. The caller holds a live lease on an object that has logically hit zero count. Fix: also treat NoAccessors as a rejection in TryAcquireLease:

if (current is Disposing or NoAccessors) return false;

H2 — PersistedSnapshotMerger: TryRead return ignored; 0x00 default treated as Destructed (line 246)

SelfDestructDestructedMarkerByte = 0x00 and flag is initialised to 0. If TryRead returns false (corrupt value bound, seek failure), flag stays 0 = DestructedMarkerByte, so barrier = i is set. A re-created contract's NewMarker (0x01) entry failing to read silently triggers the self-destruct barrier, dropping that address's slots in the compacted snapshot.

H3 — PersistedSnapshotMerger: merged self-destruct tag is wrong for re-created contracts (line 250)

When a compaction range spans a destruction (source A, Destructed) followed by a re-creation (source B > A, New), the loop correctly sets barrier = A and keeps source-B's slots, but the emitted tag is Destructed (because barrier >= 0) rather than New. Stack consumers then see "contract destroyed" and refuse to serve the re-created contract's state. The emitted tag should be taken from the newest matching source, not from any historical destruction.

H4 — PersistedSnapshotBuilder: TryGet* return ignored, null-forgiving dereference (line 274)

snapshot.TryGetStateNode(path, out TrieNode? node) and snapshot.TryGetStorageNode(...) return values are discarded, then node!.FullRlp is dereferenced. Keys are pre-extracted with IsPersisted = true, but a concurrent prune or eviction between extraction and the write pass can null the node, producing an unguarded NullReferenceException that silently aborts the snapshot build.


Medium

M1 — ArrayPool rental not in finally (line 261)
rlpBuffer is rented at line 202 and returned at line 261. Any exception inside the per-address loop leaks the rental. Wrap in try/finally.

M2 — SortedTable.TryReadFooter: signed overflow in n * OffsetSize (line 62)
For a crafted/corrupt footer with n near long.MaxValue / 4, n * OffsetSize silently overflows to a negative value; the > table.Length guard then passes, corrupting offsetRegionStart. Pre-validate n <= (table.Length - FooterSize) / OffsetSize before multiplying.

M3 — MergeEntries: stackalloc int[n] with runtime-variable n (line 105)
Unbounded stackalloc risks StackOverflowException for large compaction batches. Add an upper-bound guard or use a fixed stack size with heap fallback.

M4 — Leb128: signed/unsigned mismatch (file)
Method signatures accept long but encoding is unsigned LEB128. Write(-1) encodes 10 bytes; Read returns long.MaxValue. Either change to ulong or document the non-negative constraint explicitly.

M5 — PersistedSnapshotCompactor.EnsureStarted not thread-safe
The _compactPersistedTask ??= ... and _boundaryCompactorTasks is null checks are unguarded. Two concurrent callers could spawn duplicate worker sets and race to consume the bounded channel. Should be protected by the existing _lock or Lazy<T>.


Low

L1 — ArenaReservation.CleanUp: wasted FadviseDontNeed when file is already dead (line 192)
FadviseDontNeed fires even when fileSurvives == false (file is about to be deleted). Gate it on fileSurvives.

L2 — SortedTableBuilder.Build: raw pointer captured before Sort (line 70)
keyBase is snapshotted from _keyBuf before _entries.Sort. Safe today because _keyBuf is never written during the sort, but there is no comment asserting this invariant. Add one to prevent a subtle use-after-realloc in future.

L3 — SortedTableEnumerator: footer failure is silent
When TryReadFooter returns false, the enumerator behaves as empty with no way for callers to distinguish "empty table" from "corrupt table". An IsValid property or a thrown exception on first MoveNext would make misuse visible.

L4 — SortedTableEnumerator: key buffer grows to exact fit
if (keyLength > _keyBuf.Length) _keyBuf = new byte[keyLength] — allocates on every growth. Use Math.Max(_keyBuf.Length * 2, keyLength) to amortize.


Design note

The 255 − tag encoding used for column/subcolumn bytes in PersistedSnapshotKey is correct and produces the desired sort order. The inline comment at SortedTable that explains the sort → HSST compatibility story is clear and helpful for future reviewers.

{
// Volatile read for starting value
long current = Volatile.Read(ref _leases);
if (current == Disposing)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — TryAcquireLease can succeed on an object mid-disposal

Between the CAS that atomically moves _leases from Single (1) → 0 and the second CAS that moves 0 → Disposing (-1), a concurrent TryAcquireLease sees _leases == 0, which is NoAccessors — not Disposing — so the guard at line 41 passes and the CAS on line 49 (0 → 1) succeeds. The caller gets a live lease on an object that should be dead.

The object is cleaned up eventually (when the re-acquirer releases), but in between it passes TryAcquireLease-based guards that callers rely on to detect a doomed file/arena.

Fix: also reject NoAccessors in the initial check:

Suggested change
if (current == Disposing)
if (current is Disposing or NoAccessors)

This makes TryAcquireLease return false as soon as the count hits zero, closing the resurrection window. The inner-loop retry already handles the Disposing case, so this only adds the 0 guard to the fast-path.

int i = matching[k];
byte flag = 0;
TReader r = views[i].CreateReader();
r.TryRead(enums[i].CurrentValue.Offset, new Span<byte>(ref flag));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — TryRead return value ignored; failure silently masquerades as Destructed

SelfDestructDestructedMarkerByte = 0x00 and flag is initialised to 0. If TryRead returns false (e.g. a corrupted value bound), flag remains 0, which equals DestructedMarkerByte, so barrier = i is set. A corrupt/missing self-destruct entry for a re-created contract (which would correctly hold NewMarker = 0x01) is silently treated as a destruction, causing the merger to drop post-creation slots.

if (!r.TryRead(enums[i].CurrentValue.Offset, new Span<byte>(ref flag))) continue; // skip unreadable entries
if (flag == PersistedSnapshotTags.SelfDestructDestructedMarkerByte) barrier = i;

if (flag == PersistedSnapshotTags.SelfDestructDestructedMarkerByte) barrier = i; // newest destructed
}

table.Add(key, barrier >= 0

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — Merged self-destruct tag is wrong when a newer source re-creates the contract

The loop sets barrier to the newest source that wrote DestructedMarker, but the emitted tag uses barrier >= 0 to decide whether to write Destructed. If sources cover blocks where a contract was destroyed (source A) and then re-created (source B, index > A), the loop leaves barrier = A (the New byte 0x01 != 0x00 does not update barrier), and the emitted tag is Destructed even though the newest version is New.

Consumers of the compacted snapshot (e.g. PersistedSnapshotStack.TryGetSelfDestruct) will report the contract as destroyed and refuse to serve its storage slots, causing incorrect "contract doesn't exist" responses for a live re-created contract.

Fix: the emitted tag should be taken from the newest matching source rather than from the presence of any historical destruction:

// newest source wins for the written tag; barrier still controls slot filtering
byte newestFlag = 0;
r = views[matching[^1]].CreateReader();
r.TryRead(enums[matching[^1]].CurrentValue.Offset, new Span<byte>(ref newestFlag));
table.Add(key, newestFlag == PersistedSnapshotTags.SelfDestructDestructedMarkerByte
    ? PersistedSnapshotTags.SelfDestructDestructedMarker
    : PersistedSnapshotTags.SelfDestructNewMarker);

{
TreePath path = keys[i];
snapshot.TryGetStateNode(path, out TrieNode? node);
NodeRef nr = blobWriter.WriteRlp(node!.FullRlp.AsSpan());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

High — TryGetStateNode return value ignored; node! dereference will NRE if node was evicted

Keys are extracted from snapshot.StateNodes in the parallel lambda (line 87–95), where IsPersisted = true is set to signal that these nodes should not be evicted before the write pass. However if the IsPersisted flag race or a concurrent prune between the extraction and this write pass removes the node, TryGetStateNode returns false and node is null. The ! null-forgiving operator then throws NullReferenceException, crashing the persisted-snapshot build silently. Same pattern at line 300–301 for WriteStorageNodes.

At minimum, guard and throw a descriptive exception:

if (!snapshot.TryGetStateNode(path, out TrieNode? node) || node is null)
    throw new InvalidOperationException($"State node at {path} was evicted between extraction and persist.");

}
}

ArrayPool<byte>.Shared.Return(rlpBuffer);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — ArrayPool rental not in a finally block; leaks on any exception in the per-address loop

rlpBuffer is rented at line 202 and returned here. Any exception thrown inside the for loop (e.g. from table.Add, Rlp.Encode, or bloom operations) skips this Return, leaking the rental until the next GC:

byte[] rlpBuffer = ArrayPool<byte>.Shared.Rent(256);
try
{
    // ... for loop ...
}
finally
{
    ArrayPool<byte>.Shared.Return(rlpBuffer);
}

long n = BinaryPrimitives.ReadInt64LittleEndian(footer);
if (n < 0) return false;

long offsetRegionLength = n * OffsetSize;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Signed overflow in n * OffsetSize before the bounds guard for adversarial inputs

n is read directly from the on-disk footer as an i64. If a crafted or corrupted file provides n near long.MaxValue / 4, the multiplication n * OffsetSize silently overflows to a negative value. The guard on line 63 would then pass (negative + FooterSize < table.Length), and offsetRegionStart would be computed from the wrong (negative) offset length. Pre-validate n before multiplying:

// Guard before multiplying to prevent signed overflow on adversarial input
if (n > (table.Length - FooterSize) / OffsetSize) return false;
long offsetRegionLength = n * OffsetSize;

int barrier = -1;

Span<byte> minKey = stackalloc byte[PersistedSnapshotKey.MaxKeyLength];
Span<int> matching = stackalloc int[n];

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — stackalloc int[n] with runtime-variable, unbounded n

n = views.Length is the number of snapshot inputs. In practice this is small (typically 2–16), but there is no upper-bound guard. A large compaction batch (or a fuzzer passing many inputs) would cause StackOverflowException. A safe guard or heap fallback for large N:

Span<int> matching = n <= 64
    ? stackalloc int[64]
    : new int[n];
matching = matching[..n];

{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static long Read(ReadOnlySpan<byte> data, ref int offset)
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium — Signed/unsigned mismatch: method signatures say long but encoding is unsigned LEB128

Write casts value to ulong before encoding, so negative long values are encoded as their two's-complement bit pattern (10 bytes of 0xFF/0x01). Read performs unsigned decode with no sign-extension, so Read(Write(-1)) returns long.MaxValue (not -1). Any caller passing a negative long will silently get a corrupt round-trip.

Either change both signatures to ulong, or add signed LEB128 support (sign-extend the final group's high bit). At a minimum, document the constraint in the XML summary:

/// <summary>Reads an unsigned LEB128-encoded value. Only non-negative <see cref="long"/> values
/// are supported; the encoding is the two's-complement unsigned representation.</summary>

// skipped.
bool preserve = Volatile.Read(ref _preserveOnDispose) == 1;
bool punched = !preserve && fileSurvives && _arenaManager.TryPunchHole(_arenaFile, Offset, footprint);
if (!punched)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Wasted FadviseDontNeed syscall when the arena file is already dead

When fileSurvives == false, MarkDead has released the manager's lease on ArenaFile. The ArenaReservation still holds its own lease (so the handle remains open), but the file is about to be File.Deleted on the last-lease release at line 197. The fadvise call here is therefore wasted — the kernel will drop those pages when the file is unlinked anyway. Guard it on fileSurvives:

Suggested change
if (!punched)
if (!punched && fileSurvives)
_arenaFile.FadviseDontNeed(Offset, footprint);

Span<Entry> entries = _entries.AsSpan();
if (entries.Length > 0)
{
byte* keyBase = (byte*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(_keyBuf.AsSpan()));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low — Raw pointer captured before Sort; safe today but latent footgun

keyBase is a byte* into _keyBuf's backing store, snapshotted before _entries.Sort(...). NativeMemoryList<byte> never grows during Sort (sorting is done on _entries, not _keyBuf), so the pointer is stable. However, this two-step pattern — capture pointer, then mutate the sibling list — is fragile: a future maintainer could trigger a _keyBuf.Add inside the comparator (e.g. for diagnostics) and create a dangling pointer. A comment asserting the invariant would prevent this:

// _keyBuf is not touched during sort — only _entries is reordered.
// keyBase remains valid for the lifetime of the sort.
byte* keyBase = (byte*)Unsafe.AsPointer(ref MemoryMarshal.GetReference(_keyBuf.AsSpan()));

asdacap and others added 9 commits June 23, 2026 16:05
Shrink the SortedTable index and build bookkeeping ~8×:

- Store one offset per 8-record block instead of one per record. Records are now
  written physically sorted and contiguous, so a lookup binary searches the
  sparse offsets to a block, then sequentially scans its <=8 records (almost
  always within one page). The enumerator becomes a straight contiguous walk.
- Key and value sizes are a single byte each (keys are <=55 B; all inline values
  are <255 B, enforced by the builder's checked cast).
- Referenced blob-arena ids become one record each (key [0x00][id BE], value
  [0x01]) in a column that sorts before all others — so they dedup into the union
  through the normal N-way merge and iterate cheaply from the table start, with no
  list value and no chunking. Drops MergeRefIds and the ref_ids metadata entry.
- Footer carries blockSize; SortedTable format version 1->2; SnapshotCatalog
  version 2->3 (incompatible on-disk layout, dev DBs wipe-and-resync).

Full Nethermind.slnx builds; Nethermind.State.Flat.Test is green, including new
block-boundary SortedTableTests cases and the compaction/ref-id round-trips.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each record now stores `[commonPrefix u8][suffixLen u8][keySuffix][vs u8][value]`:
only the bytes that differ from the previous key, plus the shared-prefix length.
Adjacent sorted keys share long prefixes (per-address slots share `FE<addr>FD` +
~30 slot bytes, storage/state nodes share their column/addrHash prefix, etc.), so
this is a large on-disk saving for the dominant key classes.

Block-start records (every 8th, reached via the sparse offset) force commonPrefix=0
/ full key, so each block decodes standalone. The reader binary-searches block-start
full keys then reconstructs in-block keys with a running buffer; the enumerator
reconstructs during its contiguous walk. Builder `Add`, the merger, the scanner, and
`PersistedSnapshot` are unchanged — keys flow in/out uncompressed, compression is
internal to the builder write pass / reader / enumerator.

SortedTable format version 2->3; SnapshotCatalog version 3->4 (incompatible record
layout, dev DBs wipe-and-resync). Adds a long-shared-prefix SortedTableTests case;
full Nethermind.State.Flat.Test stays green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The last caller was removed in 71a2da5 (collapse TwoByteSlot u16/u24
fork), leaving Leb128 as dead code with no remaining references.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…parator index

Replace the single-level sparse-offset SortedTable with a LevelDB-style two-level layout: size-bounded 4KB data blocks, each prefixed with a u16 restart table for in-block binary search, plus a tail separator-key index that the first-level lower-bound search operates on. Keeps the first-level search off the scattered data pages and bounds blocks by size rather than a fixed record count.

Breaking on-disk change: SortedTable format v3->v4 and catalog v4->v5 (clean break — wipe and resync). The Add/Build/TrySeek/enumerator surface is unchanged; only the internal footer/TryReadFooter shape moved. Extends SortedTableTests (multi-block, restart boundaries, gap/sentinel misses, multi-block enumeration, randomized fuzz).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rtedTable

Extract the per-block encode/decode into a reusable BlockBuilder/BlockReader used for BOTH the data blocks and the top-level index (key = separator, value = u32 block number). Data blocks are 4 KiB-aligned and addressed by block number (byte = blockNumber * 4096), so a u32 reaches a 16 TiB table; the last data block is left unpadded so small tables stay compact (the footer's lastBlockSize locates the index). Each block self-describes its offset width (2 or 4 bytes) via a leading flag, so 4 KiB data blocks use u16 offsets while the multi-MB index block uses u32. Lookups are two ceiling searches (index -> block number -> data block + exact match). Restart interval is now a builder option, default 8 (was 16).

Breaking on-disk change: SortedTable format v4->v5 and catalog v5->v6 (clean break - wipe and resync). Add BlockTests (offset-width selection + ceiling edge cases) and extend SortedTableTests (4 KiB alignment, block-number addressing, interval-8 restart boundaries).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- SmallRefCountingDisposable: reject a lease acquisition once the count reaches
  zero (not only Disposing), checked inside the retry loop too — closes the
  resurrection window between the release path's 1→0 and 0→Disposing CAS. [H1]
- PersistedSnapshotMerger.MergeSelfDestruct: skip self-destruct entries whose
  TryRead fails rather than letting the zero default read as the destructed
  marker and set a spurious truncation barrier. [H2]
- PersistedSnapshotBuilder: throw a descriptive error if a state/storage node
  disappears between extraction and the write pass, instead of an NRE on node!. [H4]
- PersistedSnapshotBuilder.WritePerAddress: return the pooled RLP buffer in a
  finally so it is not leaked on an exception inside the loop. [M1]
- SortedTable.TryReadFooter: bound numBlocks by the table size before the int
  cast / offset math so a corrupt footer cannot overflow to a negative count. [M2]
- PersistedSnapshotMerger.MergeEntries: cap the matching-sources stackalloc and
  fall back to the heap for an unusually large compaction batch. [M3]
- PersistedSnapshotCompactor.EnsureStarted: guard worker startup with a lock so
  concurrent EnqueueAsync callers cannot spawn duplicate worker sets. [M5]
- ArenaReservation.CleanUp: skip the wasted FadviseDontNeed when the file is
  about to be deleted. [L1]
- SortedTableBuilder: document the recordBase-stable-during-sort invariant. [L2]

Left intentionally: the merged self-destruct tag (review H3) stays "destructed if
any source in the range destructed". Its only value-consumer, PersistenceManager,
does `if (SelfDestructFlag is false) batch.SelfDestruct(addr)` before re-applying
the barrier-filtered post-destruct slots, so the destruct must be reported to clear
stale RocksDB storage for a re-created contract; emitting "new" would leak it. The
read path keys off the barrier (presence), not the value. Rationale documented on
MergeSelfDestruct. Moot after later refactors: M4 (Leb128 deleted), L4 (enumerator
key buffer is now fixed-size).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The persisted-snapshot metadata blob (a SortedTable) could not exceed
~2 GiB because SortedTableBuilder buffered every record in an int-capped
NativeMemoryList and sorted at Build. Replace that with a streaming,
order-enforcing builder: Add now requires strictly ascending keys and
writes each record straight into a 4 KiB data block, so the table size is
bounded by the 16 TiB data region instead of an in-memory buffer.

The single index block is no longer recomputed from the (int) block count;
it is located directly by a new i64 indexOffset in the footer and left
unaligned. Footer widened to i64 fields (count, numDataBlocks, indexOffset);
reader/enumerator widened int -> long. No multi-level/B-tree index.

Producers now emit in global ascending key order (the builder enforces it):
- PersistedSnapshotBuilder: ascending column order, storage nodes via a
  3-way merge of the fallback/compact/top sublists (path encodings are
  order-preserving, so list order matches encoded-byte order).
- PersistedSnapshotMerger: entries before metadata; metadata in name order;
  buffered slots flushed at the slot -> self-destruct/account transition so
  they land in their sorted position (MergeSelfDestruct split into
  ComputeSelfDestructBarrier + EmitSelfDestruct).

EstimateSize drops the 2 GiB Math.Min cap (EstimateMemory bounds the
metadata table, which stores only 6-byte NodeRefs, not node RLP).

Format break: SortedTable.FormatVersion 5->6, catalog v6->v7,
MetadataFormatVersion 0x05->0x06; old blobs are rejected (wipe and resync).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses review comments on IHsstByteReader / IHsstReaderSource ("who uses
this?"). They are still the core read seam, but the HSST format was removed
earlier in this PR, so the Hsst/ folder now holds only generic byte-IO seams.

- IHsstByteReader -> IByteReader, IHsstReaderSource -> IByteReaderSource.
- Move the seam types out of the misnamed Hsst/ folder: namespace + folder
  Nethermind.State.Flat.Hsst -> Nethermind.State.Flat.Io (and the test
  counterpart Test/Hsst -> Test/Io). ~22 usings updated.
- Purge the remaining "HSST" mentions from comments/docs (rephrased to
  "columnar" / "sorted-table"), and fix stale references to removed types
  (HsstReader cref, HsstBTreeBuilder, NWayMergePerAddressHsst, LeaseBlobIdsFromHsst).

Pure rename + comment cleanup, no behavior change. Full Nethermind.slnx builds;
Nethermind.State.Flat.Test green (760 passed, 4 skipped).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant