fix(consensus): restore staking state on Pass-2 rollback#792
Conversation
apply_block_pass2 mutates stake_registry / epoch_manager / slashing via the centralized reward bundle before the #1e state_root check, but the C-03 rollback snapshot didn't capture them. Post STATE_IN_TRIE these feed the state_root, so a #1e reject left pending_rewards / epoch / liveness incremented — that leak then diverged the next block's computed root and crawled the chain once both forks were live. Snapshot + restore them so the Pass-2 rollback is atomic over every state_root input. Bump workspace 2.2.30 -> 2.2.31.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Too much diff to scan? Review this PR in Change Stack to start with the highest-impact changes. 📝 WalkthroughWalkthroughThis PR extends Block Pass 2 rollback recovery to include staking subsystem state. When block application fails partway through Pass 2 execution, the snapshot now captures and restores Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Possibly related PRs
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@crates/sentrix-core/src/block_executor.rs`:
- Around line 620-622: The non-rollbackable side effects in apply_block_pass2
(notably writes to TABLE_BLOOM and subscriber notifications like emit_new_head /
emit_finalized) must be deferred until after the state_root check that can still
reject the block; modify apply_block_pass2 to buffer TABLE_BLOOM updates and all
subscriber emits (or gate them behind a success flag) and perform those buffered
writes/emits only after the state_root verification passes, and similarly ensure
the rollback that restores stake_registry, epoch_manager, and slashing (and the
analogous code around the earlier restore at the other mentioned spot) remains
unchanged but occurs before any buffered side-effects are flushed so no phantom
data or notifications escape a failed state_root.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 7e91818e-7540-44d3-8d7f-857101e1ad7e
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (2)
Cargo.tomlcrates/sentrix-core/src/block_executor.rs
| stake_registry: self.stake_registry.clone(), | ||
| epoch_manager: self.epoch_manager.clone(), | ||
| slashing: self.slashing.clone(), |
There was a problem hiding this comment.
Rollback is still non-atomic for pre-#1e side effects.
Restoring stake_registry, epoch_manager, and slashing closes the consensus leak, but apply_block_pass2 still performs success-only side effects before the later state_root reject path can fire. In this same function, TABLE_BLOOM is written at Lines 1507-1517 and subscriber notifications are emitted throughout Pass 2, including emit_new_head / emit_finalized at Lines 1573-1585, while the #1e rejection still happens later at Lines 1632-1748. A block that fails the root check will now rewind staking state but can still leave phantom query data or notify clients about a block that never committed. Please defer those non-rollbackable writes/emits until after the state_root check, or buffer them behind the success path. As per coding guidelines, crates/sentrix-core/src/block_executor** is consensus-critical and state-apply must be deterministic.
Also applies to: 669-675
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@crates/sentrix-core/src/block_executor.rs` around lines 620 - 622, The
non-rollbackable side effects in apply_block_pass2 (notably writes to
TABLE_BLOOM and subscriber notifications like emit_new_head / emit_finalized)
must be deferred until after the state_root check that can still reject the
block; modify apply_block_pass2 to buffer TABLE_BLOOM updates and all subscriber
emits (or gate them behind a success flag) and perform those buffered
writes/emits only after the state_root verification passes, and similarly ensure
the rollback that restores stake_registry, epoch_manager, and slashing (and the
analogous code around the earlier restore at the other mentioned spot) remains
unchanged but occurs before any buffered side-effects are flushed so no phantom
data or notifications escape a failed state_root.
…ciliation (post-#792) (#795) * test(consensus): regression for Pass-2 rollback restoring staking state Pads past STATE_ROOT_FORK_HEIGHT, runs the reward bundle, then forces a #1e via a tampered state_root and asserts pending_rewards rolls back. Fails before the snapshot fix (leak), passes after. * fix(consensus): apply finalized peer block from local stash On FinalizeBlock for a peer-proposed block the node already held the block (it had voted for it; hash matches the action's block_hash, justification carries the 2/3 precommit certificate) but refused to apply it unless it was the local proposer — instead triggering libp2p sync and breaking. When NewBlock gossip missed, the node sat in Finalize re-requesting a block it already had, stalling/crawling the chain. Apply it locally like the self-propose arm; validate_block still re-checks structure + justification supermajority before the write. No state_root/format change. * fix(consensus): quiet expected #1e on finalize apply-from-stash The apply-from-stash path commits a peer-proposed finalized block via the SelfProduced source; the stashed proposal carries the proposer's pre-apply state_root, which never matches the freshly computed post-apply root, so it trips the #1e CHECK every block (it still self-heals via the libp2p receive path, which commits the canonical block). That flooded the logs ~1/block and falsely drove the DivergenceTracker alarm. Keep the LOUD alarm + tracker only for a Peer-source mismatch (a real cross-node divergence); log the self-apply case at debug and skip the tracker. Behavior unchanged (still returns Err → receive-path commit); only logging/metrics change. * fix(storage): poll-driven batched block persistence to close sync gaps The save-writer only persisted block:{N} keys when a finalized height was pushed onto save_tx, but that push lives in the commit path that is skipped whenever add_block returns Err on the BFT apply-from-stash state_root recompute mismatch. The block is still canonical (2/3 precommit justification) and the chain advances, so most heights never got a block:{N} key — they aged out of the in-memory window into permanent storage gaps, stranding observer/fullnode GetBlocks sync on the missing height. Make the writer poll-driven: every 5s persist the newly-committed block range by chain membership (not by the apply result), via a new batched save_blocks — one MDBX write txn / one fsync for the whole range. A first attempt used per-block save_block, whose per-block full-env mdbx.sync() contended with the apply path's trie write txns and stalled consensus (3/s -> 0.1/s); batching collapses that to one fsync per tick. The full state blob (save_blockchain) now runs on a 60s cadence purely to bound load-time B2 replay, which already rebuilds accounts from the block:{N} keys; the graceful-shutdown path still writes the blob on clean exit. Verified on testnet: block rate holds ~2.9/s (no stall) and post-fix blocks stay servable below the in-memory window (0 gaps). * fix(storage): stop B3b overwriting total_minted with closed-form on load load_blockchain's B3b reconcile overwrote total_minted with a closed-form sum (TOTAL_PREMINE + Σ flat BLOCK_REWARD>>halvings per height). That assumes every block's coinbase.amount equals the flat reward, but blocks with a reduced/zero coinbase make the true minted — the sum of the proposer-stamped coinbase.amount, which block_executor.rs:795 accumulates live and which all running validators agree on — strictly LESS than the closed form (~3000 SRX at testnet h≈6.27M). total_minted feeds state_root, so the overwrite forced a divergent total_minted into the root on every load: an observer GetBlocks-#1e rejected every block, and a validator restart would have forked. The blob holds the canonical live value and B2 replay re-applies coinbase for the post-checkpoint range, so the value is already correct after load — the closed-form overwrite was redundant and wrong. Keep the comparison as an advisory warn; do not mutate. Verified: a node reloaded with this change reconciles total_minted to the live/consensus value instead of the inflated closed form. * feat(consensus): add one-time treasury rebase reconciliation (fork-gated) PROTOCOL_TREASURY drifted across validators (val2/val3 over-credited ~1-2 SRX) during the multipath-distribute_reward era (STATE_ROOT_V2_HEIGHT 2689134 → REWARD_APPLY_PATH_HEIGHT 6239300). Credit is single-path/deterministic since 6239300 so the drift is frozen, but because treasury is committed in the state_root trie (since 2689134) each node computes a divergent local state_root — observers #1e-reject every block (validators tolerate via apply-from-stash). Treasury is the SOLE divergent account (all others byte-identical fleet-wide). Add a SEPARATE one-time force-set in update_trie_for_block gated on new env TREASURY_REBASE_HEIGHT + TREASURY_REBASE_BALANCE (default u64::MAX = dormant, ships safe). It must NOT reuse STATE_ROOT_V2_HEIGHT: that var also drives the trie-INCLUSION cutoff (block.index >= STATE_ROOT_V2_HEIGHT), so moving it would retroactively drop treasury from the trie for the historical range and fork. At the activation height every node sets the same operator-set canonical → converge; deterministic so B2 replay re-applies it. Activation runbook: pick canonical (supply-consistent majority or history-recompute), halt-all, set the two env vars, simul-start, verify treasury agreement at the activation block. * feat(node): observer-tolerant state_root accept for fullnodes (gated) An observer/fullnode applies every block via add_block_from_peer (Peer); the Peer branch of the #1e state_root check rejected any block whose proposer-stamped root differed from the local recompute, which on a chain whose state-commitment is imperfect (recurring/oscillating state_root that validators already tolerate via apply-from-stash, since consensus is on block_hash not state_root) means the observer rejects EVERY block and can never sync — even from a clean cp with byte-identical blocks and a canonical trie root_at_version. Gate a tolerant path behind SENTRIX_OBSERVER_TOLERANT_STATE_ROOT=1 (default OFF, validators unchanged): for a Peer block — which has already passed the strict 2/3 precommit justification verification earlier in add_block_impl, so it IS the network-agreed canonical block — accept on #1e by stamping the received (proposer's, canonical) root and returning Ok instead of Err, logging the local divergence at debug. The observer's local accounts already diverge from that root (the same pre-existing commitment imperfection every node has), so served state is no worse than a validator's. Rejecting a 2/3-justified block is halting on canonical data; accepting+stamping keeps the observer's chain consistent with the committed roots so it can sync and serve RPC. Validated on the testnet fullnode: ok=261 err=0 (was ok=0 err≈20), CRITICAL #1e=0, catching up at ~10 blk/s. Does not weaken the justification verify; validators keep strict #1e (env off). * test(storage): update B3b test for advisory no-overwrite behavior fix-B made B3b advisory (no longer overwrites total_minted with the closed-form, which over-counts on reduced-coinbase chains and feeds a divergent state_root). The old test asserted the overwrite; flip it to assert the persisted blob value survives load untouched. Sole failing test in CI (272 passed, 1 failed); all 12 storage tests green locally.
Problem
Validators on the internal testnet computed different
state_roots for thesame block+parent after both the state-in-trie and reward-in-apply forks
activated, producing a stream of
#1estate_root mismatches and reducingblock production to a crawl. Committed history stayed consistent, but the
chain could not sustain finalization.
Root cause
apply_block_pass2runs the centralized reward bundle(
apply_reward_bookkeeping_for_latest_block) before the#1estate_root check. That bundle mutates
stake_registry(pending_rewards),epoch_manager, andslashing(liveness).The C-03 Pass-2 rollback snapshot (
BlockchainSnapshot) capturedaccounts / contracts / nft_registry / authority / mempool / total_minted / chain_len / trie_root— but not those three. OnceSTATE_IN_TRIEactivated,
pending_rewards / epoch / livenessare committed into thestate_root. So every rejected (#1e) block left them incremented; theleaked values then diverged the next block's computed root, compounding
per-node. This is the interaction of the state-in-trie fork, the
reward-in-apply fork, and the older (pre-both) rollback snapshot.
Fix
Snapshot and restore
stake_registry,epoch_manager, andslashinginthe Pass-2 rollback path, so the rollback is atomic over every input
the
state_rootnow commits. Success path is unchanged; only the rejectpath is made complete. No change to the committed-block root formula, so no
fork gate is required — deploy fleet-wide.
Verified every
state_rootinput inupdate_trie_for_blockis now covered:accounts, per-validator pending_rewards, liveness, epoch_state, total_minted,
SRC-20 / NFT registry hashes.
Validation
Deployed to the 4-validator internal testnet:
#1emismatches dropped from~1/s to 0 across all validators, and committed
state_rootis identicalacross all nodes at every height.
Follow-up (this PR)
Regression test: a
#1ereject after the reward bundle runs, assertingpending_rewards / epoch / livenessare restored to their pre-block values.Summary by CodeRabbit
Chores
Bug Fixes