Skip to content

perf: fix consensus stability, CPU saturation, and memory growth on idle chains#75

Merged
0xZunia merged 10 commits intomainfrom
perf/fix-cpu-memory-hotpaths
Mar 29, 2026
Merged

perf: fix consensus stability, CPU saturation, and memory growth on idle chains#75
0xZunia merged 10 commits intomainfrom
perf/fix-cpu-memory-hotpaths

Conversation

@0xZunia
Copy link
Copy Markdown
Contributor

@0xZunia 0xZunia commented Mar 28, 2026

Summary

Fixes critical consensus, performance, and memory issues discovered through validator log analysis on a 4-node Docker devnet. The chain previously halted permanently after ~1.5 hours due to validator set divergence, and CPU saturated at 100% on idle chains.

Before: chain halts at ~3,300 blocks, 100% CPU, ~50 MB/hour RAM growth
After: 36+ hours stable, healthy CPU, bounded RAM

Root Causes & Fixes

Consensus Stability

  • COMMIT bitmap race condition — When COMMIT quorum (3/4) was reached, the aggregate was built immediately; the 4th validator's vote (on a concurrent TCP thread) was excluded from the bitmap. Over thousands of blocks, the same validators consistently lost the race, got slashed below the inactivity threshold, and were permanently deactivated. Different leaders had different placeholder BLS key states, causing each node to slash different validators — producing divergent validator sets that could never reconverge, permanently halting the chain.
    • Re-aggregate COMMIT QC when late votes arrive after quorum
    • Accept updated COMMIT aggregates on non-leader nodes
    • Wait up to 15s for all validator identities (BLS keys) to resolve via P2P handshake before starting consensus
    • Log BLS key mismatches that were previously completely silent
    • Re-activate validators when stake recovers above minimum after slashing
    • Guard block proposals with validator set membership check

CPU Saturation

  • Unnecessary sync on 1-block gaps — Every block announce from 3 peers triggered TrySyncFromPeers(), each calling Fork()ComputeStateRoot() (expensive Merkle trie rehashing). 12+ wasted state root computations per block.
  • Double message serializationBroadcastConsensusMessage serialized for dedup hash, then BroadcastToAll serialized again for sending.
  • Fork for already-applied blocks — Sync ApplyBatch called Fork() even when all blocks were already finalized by consensus.
  • TrieStateDb dirty tracking never cleared_dirtyStorageKeys and _dirtyAccounts on the TrieStateDb layer grew unboundedly (the FlatStateDb cleanup never forwarded to the inner trie), causing escalating GC pressure → 100% CPU after hours.
  • Error handler spin risks — ReconnectLoop and TCP AcceptLoop had catch blocks that continued without delay on persistent errors.

Memory Growth

  • RocksDB unbounded block cache — No block cache limit was configured; native memory grew ~50 MB/hour as SST index/filter blocks scaled with on-disk data. Added 64 MB shared LRU cache with CacheIndexAndFilterBlocks.
  • FlatStateDb storage cache unbounded_storageCache grew monotonically with no eviction. Added 500K entry hard cap with full eviction.
  • Sync path missing dirty tracking cleanupApplyBatch never called ClearDirtyTracking()/CompactDeletedSets() on the forked state before swap.
  • PruneInactivePeers never called — Disconnected peers accumulated forever.
  • ProcessUnbonding never called — Unbonding queue grew with every unstake.
  • SlashingEngine history unbounded — Capped at 10K entries.
  • EpisubService timer crash risk — Unhandled exception in cleanup callback would terminate the process.

Test plan

  • dotnet build — 0 warnings, 0 errors
  • dotnet test — all tests pass across 16 projects
  • 4-validator Docker devnet: 36+ hours uptime, stable consensus
  • First epoch transition (block 100): all 4 validators survive, quorum 3/3
  • CPU stable (no 100% saturation)
  • RAM bounded (no 50 MB/hour growth)
  • No "Vote from non-validator" or "Invalid aggregate signature" warnings
  • No "BLS key mismatch" warnings (identity wait works)

0xZunia added 9 commits March 23, 2026 12:50
ComputeStateRoot was unconditionally rewriting all 9 system contract
accounts into the world trie on every call — even with zero state changes.
Each rewrite triggered BLAKE3 rehashing of the entire trie path. Combined
with MerklePatriciaTrie.Put never short-circuiting on identical values,
this caused ~50+ unnecessary hash computations per block across all
validators, explaining the 100% CPU with zero transactions.

CPU fixes:
- Skip unchanged StorageRoot in ComputeStateRoot (TrieStateDb)
- Short-circuit identical leaf values in trie Put (MerklePatriciaTrie)
- Cache BLS public key at startup in both consensus modes
- Skip TWAP carry-forward and limit order matching when 0 DEX pools exist
- Add 1s cooldown to fire-and-forget sync in pipelined consensus loop
- Cache ConnectedPeers/ConnectedCount in PeerManager (O(1) vs O(n) LINQ)
- Cap RocksDB background threads to prevent CPU saturation on small VMs
- Add SpinWait to CAS loops in PeerInfo and ResourceLimiter

Memory fixes:
- Clear BlockData + signature lists in CleanupFinalizedRounds
- Remove .ToArray() from _proposalsByView cleanup (O(n) alloc per block)
- Reduce Episub seen-message cache 200K→50K, TTL 120s→60s
- Add proactive 30s cleanup timer to Episub (was only reactive on overflow)
- FlatStateDb: add ClearDirtyTracking() and CompactDeletedSets() to
  prevent _dirtyStorageKeys, _dirtyAccounts, _deletedStorage, and
  _deletedAccounts from growing unboundedly on the canonical instance.
  Called after each block finalization in BlockApplier.

- TrieStateDb: cache ComputeStateRoot() result and skip recomputation
  when no writes have occurred since last call. Clear _storageTries
  after flushing storage roots to release accumulated trie node refs.

- BlockBuilder: early return in ApplyDexSettlement when pool count is 0,
  avoiding TWAP/order operations and allocations on every empty block.

- Mempool: early return with static empty list when no pending txs or
  DEX intents, avoiding container allocations on every block proposal.

- WebSocketHandler: skip broadcast serialization when no clients are
  connected.

- RocksDbStore: reduce write buffer sizes (~1.3GB → ~320MB total) and
  scale compaction threads to core count instead of hardcoded 2+1.
…scade

Fix COMMIT aggregate bitmap race condition that caused systematic
inactivity slashing of validators. When COMMIT quorum (3/4) was reached,
the aggregate was built immediately — the 4th validator's vote, being
processed on a concurrent TCP thread, was excluded from the bitmap.
Over thousands of blocks, the same validators consistently lost the
race due to thread scheduling, dropping below the 50% inactivity
threshold and getting slashed every epoch until deactivated.

- Re-aggregate COMMIT QC when late votes arrive after quorum
- Accept updated COMMIT aggregates on non-leader nodes
- Handle bitmap-only updates for already-finalized blocks
- Re-activate validators when stake recovers above minimum
- Guard block proposals with validator set membership check
- Add storage cache eviction at 500K entries (was unbounded)
- Clean dirty/deleted tracking on sync path (was consensus-only)
- Call PruneInactivePeers in ReconnectLoop (was never called)
- Drain unbonding queue per finalized block (was never called)
- Cap slashing history at 10K entries (was unbounded)
- Throttle RPC sync polling with 500ms minimum delay
The root cause of validator set divergence (and permanent chain halt) is
that consensus starts before all P2P handshakes complete. Leaders whose
validator set still has placeholder BLS keys silently drop votes from
unresolved peers — no log, no warning. Different leaders have different
placeholder states (depending on Docker container startup order), so
different blocks end up with different commit bitmaps. At epoch boundary,
each node computes different inactivity slashing, producing divergent
validator sets that can never reconverge.

- Wait up to 15s for all validator identities to resolve via handshake
  before entering the consensus loop
- Add warning log when votes are dropped due to BLS key mismatch
  (previously completely silent — made diagnosis impossible)
…hains

Block announce messages from peers trigger TrySyncFromPeers even for
1-block gaps. With 3 peers broadcasting every finalized block, this
causes ~3 Fork+ComputeStateRoot cycles per block per node — each
involving expensive Merkle trie rehashing, TCP round-trips, and fork
state creation. By the time sync fetches the block, consensus has
already finalized it, so the work is entirely wasted.

On a 4-validator devnet, this wastes ~12 unnecessary state root
computations per block, saturating CPU at 100% on idle chains.

- Only trigger sync from block announces when gap > 1 block
- Only trigger sync from OnBehindDetected when gap > 1 block
- 1-block gaps resolve naturally via consensus (~1s latency)
BroadcastConsensusMessage serialized the message once for dedup hashing,
then BroadcastToAll serialized it again for sending. With 3 aggregate QCs
per block × 4 validators, this wasted 12 serializations per block — each
involving MessageCodec encoding and byte[] allocation.

- Add BroadcastRawToAll that takes pre-serialized bytes
- Reuse the dedup-serialized bytes for the actual broadcast
…pplied

When consensus finalizes a block before the sync response arrives,
ApplyBatch still called stateDbRef.Fork() — which triggers
TrieStateDb.ComputeStateRoot() with full Merkle trie rehashing — only
to discover all blocks are already applied and discard the fork.

On a 4-validator idle chain, this wasted ~3 state root computations per
block per node (one per sync attempt from each peer).

- Add fast path in ApplyBatch: check if all blocks are already in the
  chain before forking. If so, just record epoch data and return.
- Add early-exit in Mempool.PruneStale when mempool is empty, avoiding
  lock acquisition, list allocation, and trie reads on every block.
TrieStateDb._dirtyStorageKeys and _dirtyAccounts were NEVER cleared.
The interface default ClearDirtyTracking() was a no-op, and FlatStateDb
only cleared its own sets without forwarding to the inner TrieStateDb.

Every block added entries: SetStorage → _dirtyStorageKeys.Add(),
SetAccount → _dirtyAccounts.Add(). Over hours, these HashSets grew
unboundedly (3,600+ entries/hour on idle chains), creating GC pressure
that escalated to 100% CPU as Gen 2 collections became frequent.

- Add ClearDirtyTracking() override to TrieStateDb
- Forward FlatStateDb.ClearDirtyTracking() to inner TrieStateDb
RocksDB native memory grew ~50 MB/hour because no block cache limit was
set. Each column family created its own unbounded cache, and as SST files
accumulated (trie nodes are never pruned), index/filter blocks consumed
increasing amounts of native memory outside the .NET heap.

- Add shared 64 MB LRU block cache across all 8 column families
- Enable CacheIndexAndFilterBlocks to keep index/filter in the bounded
  cache instead of unbounded native allocations
- Add delay in ReconnectLoop and AcceptLoop error handlers to prevent
  tight spin on persistent socket errors
- Add catch in EpisubService timer callback to prevent unhandled
  ThreadPool exception from terminating the process
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses multiple validator-reported issues causing consensus divergence/halts, idle-chain CPU saturation, and unbounded memory growth by improving COMMIT QC aggregation behavior, reducing unnecessary expensive state work, and adding cache/collection bounds plus cleanup paths across node/network/storage components.

Changes:

  • Fixes consensus stability around COMMIT bitmap races (re-aggregating/accepting updated COMMIT QCs, waiting for validator identities before consensus start, and guarding proposal eligibility).
  • Reduces idle-chain CPU load by avoiding redundant sync/fork/state-root work and cutting repeated allocations/serialization in gossip, peer management, and block production paths.
  • Bounds memory growth via RocksDB cache configuration, state DB dirty/tombstone compaction, gossip cache caps + cleanup, and other long-lived collection limits.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/Basalt.Consensus.Tests/BasaltBftTests.cs Updates consensus test expectations to allow updated COMMIT aggregates.
src/storage/Basalt.Storage/TrieStateDb.cs Adds cached state root + clears storage trie cache; introduces dirty tracking cleanup.
src/storage/Basalt.Storage/Trie/MerklePatriciaTrie.cs Skips rehash/write when leaf value is unchanged.
src/storage/Basalt.Storage/StateDbRef.cs Forwards new state-db maintenance APIs (dirty tracking / deletion compaction).
src/storage/Basalt.Storage/RocksDb/RocksDbStore.cs Tunes RocksDB options and introduces shared LRU block cache.
src/storage/Basalt.Storage/IStateDatabase.cs Adds default no-op APIs for clearing dirty tracking and compacting deletion sets.
src/storage/Basalt.Storage/FlatStateDb.cs Adds eviction/cleanup paths to bound caches and tracking sets.
src/node/Basalt.Node/NodeCoordinator.cs Adds identity-wait before consensus, sync cooldowns, >1-block gap sync gating, proposal membership guard, proposal pruning iteration optimizations, peer pruning, and updated-bitmap handling.
src/node/Basalt.Node/BlockSyncService.cs Adds minimum polling delay to reduce sync pressure.
src/node/Basalt.Node/BlockApplier.cs Adds canonical/fork cleanup hooks, sync fast-path for already-applied blocks, and unbonding processing.
src/network/Basalt.Network/Transport/TcpTransport.cs Adds delay on accept-loop socket errors to avoid tight spins.
src/network/Basalt.Network/PeerManager.cs Adds cached connected-peer snapshot + count to reduce repeated LINQ/allocation overhead.
src/network/Basalt.Network/PeerInfo.cs Adds SpinWait in CAS loop to reduce CPU burn under contention.
src/network/Basalt.Network/GossipService.cs Avoids double-serialization by broadcasting pre-serialized bytes.
src/network/Basalt.Network/Gossip/EpisubService.cs Tightens caps/TTLs and adds proactive timer-based cleanup plus tier broadcast optimizations.
src/execution/Basalt.Execution/VM/Sandbox/ResourceLimiter.cs Adds SpinWait in CAS loop to reduce CPU burn under contention.
src/execution/Basalt.Execution/Mempool.cs Adds empty fast paths for pending retrieval and pruning.
src/execution/Basalt.Execution/BlockBuilder.cs Skips DEX work when no pools exist and reuses DEX state where possible.
src/consensus/Basalt.Consensus/Staking/StakingState.cs Re-activates validators when stake recovers above minimum.
src/consensus/Basalt.Consensus/Staking/SlashingEngine.cs Caps in-memory slashing history.
src/consensus/Basalt.Consensus/PipelinedConsensus.cs Caches BLS public key and cleans up finalized rounds to release memory sooner.
src/consensus/Basalt.Consensus/BasaltBft.cs Adds COMMIT bitmap update logic, caches BLS public key, and logs identity/key mismatches.
src/api/Basalt.Api.Rest/WebSocketEndpoint.cs Skips per-block WS broadcast allocations when no clients are connected.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +571 to +580
private static readonly List<TransactionReceipt> s_emptyReceipts = [];

public List<TransactionReceipt> ApplyDexSettlement(IStateDatabase stateDb, BlockHeader blockHeader)
{
var allReceipts = new List<TransactionReceipt>();
// Fast path: skip all DEX operations when no pools exist.
// On an idle chain this avoids allocating DexState objects, reading pool state,
// and running TWAP/order logic on every block.
var poolCheck = new DexState(stateDb);
if (poolCheck.GetPoolCount() == 0)
return s_emptyReceipts;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a shared mutable list (s_emptyReceipts) is risky for the same reason as the mempool empty-list fast path: if any caller mutates the returned List, it will affect all other callers. Consider returning a new empty List each time, or change the return type to IReadOnlyList and return Array.Empty()/[] for empty.

Copilot uses AI. Check for mistakes.
Comment on lines +55 to +66
// Shared LRU block cache across all column families.
// Without this, RocksDB creates an unbounded per-CF cache whose native memory
// grows proportionally to on-disk data (SST index/filter blocks, data blocks).
var blockCache = RocksDbSharp.Native.Instance.rocksdb_cache_create_lru(new UIntPtr(64 * 1024 * 1024)); // 64 MB shared

var defaultTableOptions = new BlockBasedTableOptions()
.SetBlockCache(blockCache)
.SetCacheIndexAndFilterBlocks(true);

var defaultOptions = new ColumnFamilyOptions()
.SetBlockBasedTableFactory(defaultTableOptions);

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The LRU block cache is created via rocksdb_cache_create_lru but never destroyed in Dispose(). Since this is native memory, repeated RocksDbStore instances (tests, multiple nodes in-process, etc.) can leak until process exit. Consider storing the cache handle in a field and calling rocksdb_cache_destroy in Dispose(), or using a RocksDbSharp-managed cache wrapper if available.

Copilot uses AI. Check for mistakes.
Comment on lines +69 to +83
// Proactive TTL cleanup every 30 seconds — prevents unbounded cache growth
// even when message rate stays below the reactive cap threshold.
_cleanupTimer = new Timer(_ =>
{
if (Interlocked.CompareExchange(ref _cleanupRunning, 1, 0) != 0)
return;
try { CleanupSeenMessages(); }
catch (Exception ex)
{
// Swallow to prevent process termination from unhandled ThreadPool exception.
// Timer will retry in 30 seconds.
_logger.LogWarning(ex, "Error during Episub message cache cleanup");
}
finally { Interlocked.Exchange(ref _cleanupRunning, 0); }
}, null, TimeSpan.FromSeconds(30), TimeSpan.FromSeconds(30));
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_cleanupTimer is created but never disposed/stopped. Since NodeCoordinator.StopAsync doesn’t dispose EpisubService, this timer will keep the service (and its dictionaries/logger) alive and continue running after shutdown, which can cause background work after stop and resource leaks. Consider implementing IDisposable/IAsyncDisposable on EpisubService to dispose the timer, and have NodeCoordinator.StopAsync dispose EpisubService (or expose a Stop method).

Copilot uses AI. Check for mistakes.
// the commit bitmap. Different leaders may have different placeholder states, leading
// to divergent inactivity slashing calculations at epoch boundaries — a fatal
// consistency violation that permanently splits the network.
var expectedPeers = _config.Peers.Length;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expectedPeers is declared but never used, which should trigger a compiler warning and adds noise to the startup logic. Remove it, or use it in the identity-resolution logging/threshold if that was the intent.

Suggested change
var expectedPeers = _config.Peers.Length;

Copilot uses AI. Check for mistakes.
Comment on lines +177 to +186
private static readonly List<Transaction> s_emptyTxList = [];

public List<Transaction> GetPending(int maxCount, IStateDatabase? stateDb = null)
{
lock (_lock)
{
// Fast path: empty mempool — avoid all allocations
if (_orderedEntries.Count == 0)
return s_emptyTxList;

Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning a shared mutable List (s_emptyTxList) is unsafe: callers can modify it (Add/Clear/etc.) and that mutation will be observed by all future callers, potentially corrupting mempool behavior across threads. Prefer returning a new empty List each time, or switch the API to return IReadOnlyList/IReadOnlyCollection and return Array.Empty()/[] for the empty case.

Copilot uses AI. Check for mistakes.
Comment on lines +388 to +391
// Fast path: skip when mempool is empty (common on idle chains).
// Avoids lock acquisition, list allocation, and trie reads.
if (_transactions.Count == 0 && _dexIntentTransactions.Count == 0)
return 0;
Copy link

Copilot AI Mar 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "fast path" reads _transactions.Count / _dexIntentTransactions.Count outside the _lock, but those dictionaries are mutated under the same lock elsewhere. Concurrent reads/writes on Dictionary are not thread-safe and can lead to data races or undefined behavior. Move the emptiness check inside the existing lock (or track counts via atomic fields) to keep the mempool’s synchronization consistent.

Copilot uses AI. Check for mistakes.
…e state

- Remove shared mutable static empty lists in BlockBuilder and Mempool
  that could be corrupted if any caller mutated the returned List
- Store RocksDB LRU block cache handle and destroy in Dispose() to
  prevent native memory leak across store instances
- Make EpisubService IDisposable to stop cleanup timer on shutdown,
  dispose from NodeCoordinator.StopAsync
- Move Mempool.PruneStale emptiness check inside lock to fix data race
  on Dictionary.Count reads concurrent with mutations
- Remove unused expectedPeers variable in RunConsensusLoop
@0xZunia 0xZunia merged commit 3b06236 into main Mar 29, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants