perf: fix consensus stability, CPU saturation, and memory growth on idle chains#75
perf: fix consensus stability, CPU saturation, and memory growth on idle chains#75
Conversation
ComputeStateRoot was unconditionally rewriting all 9 system contract accounts into the world trie on every call — even with zero state changes. Each rewrite triggered BLAKE3 rehashing of the entire trie path. Combined with MerklePatriciaTrie.Put never short-circuiting on identical values, this caused ~50+ unnecessary hash computations per block across all validators, explaining the 100% CPU with zero transactions. CPU fixes: - Skip unchanged StorageRoot in ComputeStateRoot (TrieStateDb) - Short-circuit identical leaf values in trie Put (MerklePatriciaTrie) - Cache BLS public key at startup in both consensus modes - Skip TWAP carry-forward and limit order matching when 0 DEX pools exist - Add 1s cooldown to fire-and-forget sync in pipelined consensus loop - Cache ConnectedPeers/ConnectedCount in PeerManager (O(1) vs O(n) LINQ) - Cap RocksDB background threads to prevent CPU saturation on small VMs - Add SpinWait to CAS loops in PeerInfo and ResourceLimiter Memory fixes: - Clear BlockData + signature lists in CleanupFinalizedRounds - Remove .ToArray() from _proposalsByView cleanup (O(n) alloc per block) - Reduce Episub seen-message cache 200K→50K, TTL 120s→60s - Add proactive 30s cleanup timer to Episub (was only reactive on overflow)
- FlatStateDb: add ClearDirtyTracking() and CompactDeletedSets() to prevent _dirtyStorageKeys, _dirtyAccounts, _deletedStorage, and _deletedAccounts from growing unboundedly on the canonical instance. Called after each block finalization in BlockApplier. - TrieStateDb: cache ComputeStateRoot() result and skip recomputation when no writes have occurred since last call. Clear _storageTries after flushing storage roots to release accumulated trie node refs. - BlockBuilder: early return in ApplyDexSettlement when pool count is 0, avoiding TWAP/order operations and allocations on every empty block. - Mempool: early return with static empty list when no pending txs or DEX intents, avoiding container allocations on every block proposal. - WebSocketHandler: skip broadcast serialization when no clients are connected. - RocksDbStore: reduce write buffer sizes (~1.3GB → ~320MB total) and scale compaction threads to core count instead of hardcoded 2+1.
…scade Fix COMMIT aggregate bitmap race condition that caused systematic inactivity slashing of validators. When COMMIT quorum (3/4) was reached, the aggregate was built immediately — the 4th validator's vote, being processed on a concurrent TCP thread, was excluded from the bitmap. Over thousands of blocks, the same validators consistently lost the race due to thread scheduling, dropping below the 50% inactivity threshold and getting slashed every epoch until deactivated. - Re-aggregate COMMIT QC when late votes arrive after quorum - Accept updated COMMIT aggregates on non-leader nodes - Handle bitmap-only updates for already-finalized blocks - Re-activate validators when stake recovers above minimum - Guard block proposals with validator set membership check - Add storage cache eviction at 500K entries (was unbounded) - Clean dirty/deleted tracking on sync path (was consensus-only) - Call PruneInactivePeers in ReconnectLoop (was never called) - Drain unbonding queue per finalized block (was never called) - Cap slashing history at 10K entries (was unbounded) - Throttle RPC sync polling with 500ms minimum delay
The root cause of validator set divergence (and permanent chain halt) is that consensus starts before all P2P handshakes complete. Leaders whose validator set still has placeholder BLS keys silently drop votes from unresolved peers — no log, no warning. Different leaders have different placeholder states (depending on Docker container startup order), so different blocks end up with different commit bitmaps. At epoch boundary, each node computes different inactivity slashing, producing divergent validator sets that can never reconverge. - Wait up to 15s for all validator identities to resolve via handshake before entering the consensus loop - Add warning log when votes are dropped due to BLS key mismatch (previously completely silent — made diagnosis impossible)
…hains Block announce messages from peers trigger TrySyncFromPeers even for 1-block gaps. With 3 peers broadcasting every finalized block, this causes ~3 Fork+ComputeStateRoot cycles per block per node — each involving expensive Merkle trie rehashing, TCP round-trips, and fork state creation. By the time sync fetches the block, consensus has already finalized it, so the work is entirely wasted. On a 4-validator devnet, this wastes ~12 unnecessary state root computations per block, saturating CPU at 100% on idle chains. - Only trigger sync from block announces when gap > 1 block - Only trigger sync from OnBehindDetected when gap > 1 block - 1-block gaps resolve naturally via consensus (~1s latency)
BroadcastConsensusMessage serialized the message once for dedup hashing, then BroadcastToAll serialized it again for sending. With 3 aggregate QCs per block × 4 validators, this wasted 12 serializations per block — each involving MessageCodec encoding and byte[] allocation. - Add BroadcastRawToAll that takes pre-serialized bytes - Reuse the dedup-serialized bytes for the actual broadcast
…pplied When consensus finalizes a block before the sync response arrives, ApplyBatch still called stateDbRef.Fork() — which triggers TrieStateDb.ComputeStateRoot() with full Merkle trie rehashing — only to discover all blocks are already applied and discard the fork. On a 4-validator idle chain, this wasted ~3 state root computations per block per node (one per sync attempt from each peer). - Add fast path in ApplyBatch: check if all blocks are already in the chain before forking. If so, just record epoch data and return. - Add early-exit in Mempool.PruneStale when mempool is empty, avoiding lock acquisition, list allocation, and trie reads on every block.
TrieStateDb._dirtyStorageKeys and _dirtyAccounts were NEVER cleared. The interface default ClearDirtyTracking() was a no-op, and FlatStateDb only cleared its own sets without forwarding to the inner TrieStateDb. Every block added entries: SetStorage → _dirtyStorageKeys.Add(), SetAccount → _dirtyAccounts.Add(). Over hours, these HashSets grew unboundedly (3,600+ entries/hour on idle chains), creating GC pressure that escalated to 100% CPU as Gen 2 collections became frequent. - Add ClearDirtyTracking() override to TrieStateDb - Forward FlatStateDb.ClearDirtyTracking() to inner TrieStateDb
RocksDB native memory grew ~50 MB/hour because no block cache limit was set. Each column family created its own unbounded cache, and as SST files accumulated (trie nodes are never pruned), index/filter blocks consumed increasing amounts of native memory outside the .NET heap. - Add shared 64 MB LRU block cache across all 8 column families - Enable CacheIndexAndFilterBlocks to keep index/filter in the bounded cache instead of unbounded native allocations - Add delay in ReconnectLoop and AcceptLoop error handlers to prevent tight spin on persistent socket errors - Add catch in EpisubService timer callback to prevent unhandled ThreadPool exception from terminating the process
There was a problem hiding this comment.
Pull request overview
This PR addresses multiple validator-reported issues causing consensus divergence/halts, idle-chain CPU saturation, and unbounded memory growth by improving COMMIT QC aggregation behavior, reducing unnecessary expensive state work, and adding cache/collection bounds plus cleanup paths across node/network/storage components.
Changes:
- Fixes consensus stability around COMMIT bitmap races (re-aggregating/accepting updated COMMIT QCs, waiting for validator identities before consensus start, and guarding proposal eligibility).
- Reduces idle-chain CPU load by avoiding redundant sync/fork/state-root work and cutting repeated allocations/serialization in gossip, peer management, and block production paths.
- Bounds memory growth via RocksDB cache configuration, state DB dirty/tombstone compaction, gossip cache caps + cleanup, and other long-lived collection limits.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/Basalt.Consensus.Tests/BasaltBftTests.cs | Updates consensus test expectations to allow updated COMMIT aggregates. |
| src/storage/Basalt.Storage/TrieStateDb.cs | Adds cached state root + clears storage trie cache; introduces dirty tracking cleanup. |
| src/storage/Basalt.Storage/Trie/MerklePatriciaTrie.cs | Skips rehash/write when leaf value is unchanged. |
| src/storage/Basalt.Storage/StateDbRef.cs | Forwards new state-db maintenance APIs (dirty tracking / deletion compaction). |
| src/storage/Basalt.Storage/RocksDb/RocksDbStore.cs | Tunes RocksDB options and introduces shared LRU block cache. |
| src/storage/Basalt.Storage/IStateDatabase.cs | Adds default no-op APIs for clearing dirty tracking and compacting deletion sets. |
| src/storage/Basalt.Storage/FlatStateDb.cs | Adds eviction/cleanup paths to bound caches and tracking sets. |
| src/node/Basalt.Node/NodeCoordinator.cs | Adds identity-wait before consensus, sync cooldowns, >1-block gap sync gating, proposal membership guard, proposal pruning iteration optimizations, peer pruning, and updated-bitmap handling. |
| src/node/Basalt.Node/BlockSyncService.cs | Adds minimum polling delay to reduce sync pressure. |
| src/node/Basalt.Node/BlockApplier.cs | Adds canonical/fork cleanup hooks, sync fast-path for already-applied blocks, and unbonding processing. |
| src/network/Basalt.Network/Transport/TcpTransport.cs | Adds delay on accept-loop socket errors to avoid tight spins. |
| src/network/Basalt.Network/PeerManager.cs | Adds cached connected-peer snapshot + count to reduce repeated LINQ/allocation overhead. |
| src/network/Basalt.Network/PeerInfo.cs | Adds SpinWait in CAS loop to reduce CPU burn under contention. |
| src/network/Basalt.Network/GossipService.cs | Avoids double-serialization by broadcasting pre-serialized bytes. |
| src/network/Basalt.Network/Gossip/EpisubService.cs | Tightens caps/TTLs and adds proactive timer-based cleanup plus tier broadcast optimizations. |
| src/execution/Basalt.Execution/VM/Sandbox/ResourceLimiter.cs | Adds SpinWait in CAS loop to reduce CPU burn under contention. |
| src/execution/Basalt.Execution/Mempool.cs | Adds empty fast paths for pending retrieval and pruning. |
| src/execution/Basalt.Execution/BlockBuilder.cs | Skips DEX work when no pools exist and reuses DEX state where possible. |
| src/consensus/Basalt.Consensus/Staking/StakingState.cs | Re-activates validators when stake recovers above minimum. |
| src/consensus/Basalt.Consensus/Staking/SlashingEngine.cs | Caps in-memory slashing history. |
| src/consensus/Basalt.Consensus/PipelinedConsensus.cs | Caches BLS public key and cleans up finalized rounds to release memory sooner. |
| src/consensus/Basalt.Consensus/BasaltBft.cs | Adds COMMIT bitmap update logic, caches BLS public key, and logs identity/key mismatches. |
| src/api/Basalt.Api.Rest/WebSocketEndpoint.cs | Skips per-block WS broadcast allocations when no clients are connected. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| private static readonly List<TransactionReceipt> s_emptyReceipts = []; | ||
|
|
||
| public List<TransactionReceipt> ApplyDexSettlement(IStateDatabase stateDb, BlockHeader blockHeader) | ||
| { | ||
| var allReceipts = new List<TransactionReceipt>(); | ||
| // Fast path: skip all DEX operations when no pools exist. | ||
| // On an idle chain this avoids allocating DexState objects, reading pool state, | ||
| // and running TWAP/order logic on every block. | ||
| var poolCheck = new DexState(stateDb); | ||
| if (poolCheck.GetPoolCount() == 0) | ||
| return s_emptyReceipts; |
There was a problem hiding this comment.
Returning a shared mutable list (s_emptyReceipts) is risky for the same reason as the mempool empty-list fast path: if any caller mutates the returned List, it will affect all other callers. Consider returning a new empty List each time, or change the return type to IReadOnlyList and return Array.Empty()/[] for empty.
| // Shared LRU block cache across all column families. | ||
| // Without this, RocksDB creates an unbounded per-CF cache whose native memory | ||
| // grows proportionally to on-disk data (SST index/filter blocks, data blocks). | ||
| var blockCache = RocksDbSharp.Native.Instance.rocksdb_cache_create_lru(new UIntPtr(64 * 1024 * 1024)); // 64 MB shared | ||
|
|
||
| var defaultTableOptions = new BlockBasedTableOptions() | ||
| .SetBlockCache(blockCache) | ||
| .SetCacheIndexAndFilterBlocks(true); | ||
|
|
||
| var defaultOptions = new ColumnFamilyOptions() | ||
| .SetBlockBasedTableFactory(defaultTableOptions); | ||
|
|
There was a problem hiding this comment.
The LRU block cache is created via rocksdb_cache_create_lru but never destroyed in Dispose(). Since this is native memory, repeated RocksDbStore instances (tests, multiple nodes in-process, etc.) can leak until process exit. Consider storing the cache handle in a field and calling rocksdb_cache_destroy in Dispose(), or using a RocksDbSharp-managed cache wrapper if available.
| // Proactive TTL cleanup every 30 seconds — prevents unbounded cache growth | ||
| // even when message rate stays below the reactive cap threshold. | ||
| _cleanupTimer = new Timer(_ => | ||
| { | ||
| if (Interlocked.CompareExchange(ref _cleanupRunning, 1, 0) != 0) | ||
| return; | ||
| try { CleanupSeenMessages(); } | ||
| catch (Exception ex) | ||
| { | ||
| // Swallow to prevent process termination from unhandled ThreadPool exception. | ||
| // Timer will retry in 30 seconds. | ||
| _logger.LogWarning(ex, "Error during Episub message cache cleanup"); | ||
| } | ||
| finally { Interlocked.Exchange(ref _cleanupRunning, 0); } | ||
| }, null, TimeSpan.FromSeconds(30), TimeSpan.FromSeconds(30)); |
There was a problem hiding this comment.
_cleanupTimer is created but never disposed/stopped. Since NodeCoordinator.StopAsync doesn’t dispose EpisubService, this timer will keep the service (and its dictionaries/logger) alive and continue running after shutdown, which can cause background work after stop and resource leaks. Consider implementing IDisposable/IAsyncDisposable on EpisubService to dispose the timer, and have NodeCoordinator.StopAsync dispose EpisubService (or expose a Stop method).
| // the commit bitmap. Different leaders may have different placeholder states, leading | ||
| // to divergent inactivity slashing calculations at epoch boundaries — a fatal | ||
| // consistency violation that permanently splits the network. | ||
| var expectedPeers = _config.Peers.Length; |
There was a problem hiding this comment.
expectedPeers is declared but never used, which should trigger a compiler warning and adds noise to the startup logic. Remove it, or use it in the identity-resolution logging/threshold if that was the intent.
| var expectedPeers = _config.Peers.Length; |
| private static readonly List<Transaction> s_emptyTxList = []; | ||
|
|
||
| public List<Transaction> GetPending(int maxCount, IStateDatabase? stateDb = null) | ||
| { | ||
| lock (_lock) | ||
| { | ||
| // Fast path: empty mempool — avoid all allocations | ||
| if (_orderedEntries.Count == 0) | ||
| return s_emptyTxList; | ||
|
|
There was a problem hiding this comment.
Returning a shared mutable List (s_emptyTxList) is unsafe: callers can modify it (Add/Clear/etc.) and that mutation will be observed by all future callers, potentially corrupting mempool behavior across threads. Prefer returning a new empty List each time, or switch the API to return IReadOnlyList/IReadOnlyCollection and return Array.Empty()/[] for the empty case.
| // Fast path: skip when mempool is empty (common on idle chains). | ||
| // Avoids lock acquisition, list allocation, and trie reads. | ||
| if (_transactions.Count == 0 && _dexIntentTransactions.Count == 0) | ||
| return 0; |
There was a problem hiding this comment.
This "fast path" reads _transactions.Count / _dexIntentTransactions.Count outside the _lock, but those dictionaries are mutated under the same lock elsewhere. Concurrent reads/writes on Dictionary are not thread-safe and can lead to data races or undefined behavior. Move the emptiness check inside the existing lock (or track counts via atomic fields) to keep the mempool’s synchronization consistent.
…e state - Remove shared mutable static empty lists in BlockBuilder and Mempool that could be corrupted if any caller mutated the returned List - Store RocksDB LRU block cache handle and destroy in Dispose() to prevent native memory leak across store instances - Make EpisubService IDisposable to stop cleanup timer on shutdown, dispose from NodeCoordinator.StopAsync - Move Mempool.PruneStale emptiness check inside lock to fix data race on Dictionary.Count reads concurrent with mutations - Remove unused expectedPeers variable in RunConsensusLoop
Summary
Fixes critical consensus, performance, and memory issues discovered through validator log analysis on a 4-node Docker devnet. The chain previously halted permanently after ~1.5 hours due to validator set divergence, and CPU saturated at 100% on idle chains.
Before: chain halts at ~3,300 blocks, 100% CPU, ~50 MB/hour RAM growth
After: 36+ hours stable, healthy CPU, bounded RAM
Root Causes & Fixes
Consensus Stability
CPU Saturation
TrySyncFromPeers(), each callingFork()→ComputeStateRoot()(expensive Merkle trie rehashing). 12+ wasted state root computations per block.BroadcastConsensusMessageserialized for dedup hash, thenBroadcastToAllserialized again for sending.ApplyBatchcalledFork()even when all blocks were already finalized by consensus._dirtyStorageKeysand_dirtyAccountson the TrieStateDb layer grew unboundedly (the FlatStateDb cleanup never forwarded to the inner trie), causing escalating GC pressure → 100% CPU after hours.Memory Growth
CacheIndexAndFilterBlocks._storageCachegrew monotonically with no eviction. Added 500K entry hard cap with full eviction.ApplyBatchnever calledClearDirtyTracking()/CompactDeletedSets()on the forked state before swap.Test plan
dotnet build— 0 warnings, 0 errorsdotnet test— all tests pass across 16 projects