fix(consensus): restore staking state on Pass-2 rollback by satyakwok · Pull Request #792 · sentrix-labs/sentrix

satyakwok · 2026-06-05T19:33:00Z

Problem

Validators on the internal testnet computed different state_roots for the
same block+parent after both the state-in-trie and reward-in-apply forks
activated, producing a stream of #1e state_root mismatches and reducing
block production to a crawl. Committed history stayed consistent, but the
chain could not sustain finalization.

Root cause

apply_block_pass2 runs the centralized reward bundle
(apply_reward_bookkeeping_for_latest_block) before the #1e
state_root check. That bundle mutates stake_registry (pending_rewards),
epoch_manager, and slashing (liveness).

The C-03 Pass-2 rollback snapshot (BlockchainSnapshot) captured
accounts / contracts / nft_registry / authority / mempool / total_minted / chain_len / trie_root — but not those three. Once STATE_IN_TRIE
activated, pending_rewards / epoch / liveness are committed into the
state_root. So every rejected (#1e) block left them incremented; the
leaked values then diverged the next block's computed root, compounding
per-node. This is the interaction of the state-in-trie fork, the
reward-in-apply fork, and the older (pre-both) rollback snapshot.

Fix

Snapshot and restore stake_registry, epoch_manager, and slashing in
the Pass-2 rollback path, so the rollback is atomic over every input
the state_root now commits. Success path is unchanged; only the reject
path is made complete. No change to the committed-block root formula, so no
fork gate is required — deploy fleet-wide.

Verified every state_root input in update_trie_for_block is now covered:
accounts, per-validator pending_rewards, liveness, epoch_state, total_minted,
SRC-20 / NFT registry hashes.

Validation

Deployed to the 4-validator internal testnet: #1e mismatches dropped from
~1/s to 0 across all validators, and committed state_root is identical
across all nodes at every height.

Follow-up (this PR)

Regression test: a #1e reject after the reward bundle runs, asserting
pending_rewards / epoch / liveness are restored to their pre-block values.

Summary by CodeRabbit

Chores
- Bumped version to 2.2.31
Bug Fixes
- Enhanced state rollback mechanism to prevent staking state mutations (rewards, epochs, liveness, slashing) from persisting when block application fails, ensuring proper cleanup of failed blocks.

apply_block_pass2 mutates stake_registry / epoch_manager / slashing via the centralized reward bundle before the #1e state_root check, but the C-03 rollback snapshot didn't capture them. Post STATE_IN_TRIE these feed the state_root, so a #1e reject left pending_rewards / epoch / liveness incremented — that leak then diverged the next block's computed root and crawled the chain once both forks were live. Snapshot + restore them so the Pass-2 rollback is atomic over every state_root input. Bump workspace 2.2.30 -> 2.2.31.

codecov · 2026-06-05T19:34:50Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

coderabbitai · 2026-06-05T19:38:14Z

Too much diff to scan? Review this PR in Change Stack to start with the highest-impact changes.

📝 Walkthrough

Walkthrough

This PR extends Block Pass 2 rollback recovery to include staking subsystem state. When block application fails partway through Pass 2 execution, the snapshot now captures and restores stake_registry, epoch_manager, and slashing in addition to existing account/contract/authority state. This prevents pending staking mutations—such as reward accumulation, epoch transitions, and liveness/slashing updates—from leaking into subsequent blocks after a failed block commit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Bug A: off-trie consensus state drift (pending_rewards, total_minted, epoch_state) — silent fork-trigger on any consuming tx #750: Directly addresses off-trie staking-state drift by ensuring consensus/staking mutations are restored together on Pass 2 rollback, preventing inconsistent state leaks into subsequent blocks.

Possibly related PRs

sentrix-labs/sentrix#782: Both PRs work with staking state mutations in apply_block_pass2 (stake_registry, epoch_manager, slashing)—this PR ensures failed mutations are rolled back via snapshot restoration, while PR #782 gates reward/epoch bookkeeping operations.
sentrix-labs/sentrix#763: Both PRs strengthen staking/epoch/slashing consistency during block processing—this PR extends Pass 2 rollback to restore staking state, while PR #763 wires epoch bookkeeping (liveness/slashing) for libp2p sync paths.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The description is detailed and well-structured, covering problem, root cause, fix, and validation. However, it does not follow the repository's required template with sections like Scope, Checks, Linked issue, and Deploy impact.	Reformat the description to follow the repository template: add Scope checkboxes, Checks checklist, Linked issue reference, and Deploy impact section.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and concisely describes the main fix: restoring staking state during Pass-2 rollback to prevent consensus divergence.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/pass2-rollback-restore-staking-state

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sentrix-core/src/block_executor.rs`:
- Around line 620-622: The non-rollbackable side effects in apply_block_pass2
(notably writes to TABLE_BLOOM and subscriber notifications like emit_new_head /
emit_finalized) must be deferred until after the state_root check that can still
reject the block; modify apply_block_pass2 to buffer TABLE_BLOOM updates and all
subscriber emits (or gate them behind a success flag) and perform those buffered
writes/emits only after the state_root verification passes, and similarly ensure
the rollback that restores stake_registry, epoch_manager, and slashing (and the
analogous code around the earlier restore at the other mentioned spot) remains
unchanged but occurs before any buffered side-effects are flushed so no phantom
data or notifications escape a failed state_root.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7e91818e-7540-44d3-8d7f-857101e1ad7e

📥 Commits

Reviewing files that changed from the base of the PR and between a132baf and b9903c1.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock, !**/*.lock

📒 Files selected for processing (2)

Cargo.toml
crates/sentrix-core/src/block_executor.rs

coderabbitai · 2026-06-05T19:38:17Z

+            stake_registry: self.stake_registry.clone(),
+            epoch_manager: self.epoch_manager.clone(),
+            slashing: self.slashing.clone(),


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Rollback is still non-atomic for pre-#1e side effects.

Restoring stake_registry, epoch_manager, and slashing closes the consensus leak, but apply_block_pass2 still performs success-only side effects before the later state_root reject path can fire. In this same function, TABLE_BLOOM is written at Lines 1507-1517 and subscriber notifications are emitted throughout Pass 2, including emit_new_head / emit_finalized at Lines 1573-1585, while the #1e rejection still happens later at Lines 1632-1748. A block that fails the root check will now rewind staking state but can still leave phantom query data or notify clients about a block that never committed. Please defer those non-rollbackable writes/emits until after the state_root check, or buffer them behind the success path. As per coding guidelines, crates/sentrix-core/src/block_executor** is consensus-critical and state-apply must be deterministic.

Also applies to: 669-675

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/sentrix-core/src/block_executor.rs` around lines 620 - 622, The non-rollbackable side effects in apply_block_pass2 (notably writes to TABLE_BLOOM and subscriber notifications like emit_new_head / emit_finalized) must be deferred until after the state_root check that can still reject the block; modify apply_block_pass2 to buffer TABLE_BLOOM updates and all subscriber emits (or gate them behind a success flag) and perform those buffered writes/emits only after the state_root verification passes, and similarly ensure the rollback that restores stake_registry, epoch_manager, and slashing (and the analogous code around the earlier restore at the other mentioned spot) remains unchanged but occurs before any buffered side-effects are flushed so no phantom data or notifications escape a failed state_root.

…ciliation (post-#792) (#795) * test(consensus): regression for Pass-2 rollback restoring staking state Pads past STATE_ROOT_FORK_HEIGHT, runs the reward bundle, then forces a #1e via a tampered state_root and asserts pending_rewards rolls back. Fails before the snapshot fix (leak), passes after. * fix(consensus): apply finalized peer block from local stash On FinalizeBlock for a peer-proposed block the node already held the block (it had voted for it; hash matches the action's block_hash, justification carries the 2/3 precommit certificate) but refused to apply it unless it was the local proposer — instead triggering libp2p sync and breaking. When NewBlock gossip missed, the node sat in Finalize re-requesting a block it already had, stalling/crawling the chain. Apply it locally like the self-propose arm; validate_block still re-checks structure + justification supermajority before the write. No state_root/format change. * fix(consensus): quiet expected #1e on finalize apply-from-stash The apply-from-stash path commits a peer-proposed finalized block via the SelfProduced source; the stashed proposal carries the proposer's pre-apply state_root, which never matches the freshly computed post-apply root, so it trips the #1e CHECK every block (it still self-heals via the libp2p receive path, which commits the canonical block). That flooded the logs ~1/block and falsely drove the DivergenceTracker alarm. Keep the LOUD alarm + tracker only for a Peer-source mismatch (a real cross-node divergence); log the self-apply case at debug and skip the tracker. Behavior unchanged (still returns Err → receive-path commit); only logging/metrics change. * fix(storage): poll-driven batched block persistence to close sync gaps The save-writer only persisted block:{N} keys when a finalized height was pushed onto save_tx, but that push lives in the commit path that is skipped whenever add_block returns Err on the BFT apply-from-stash state_root recompute mismatch. The block is still canonical (2/3 precommit justification) and the chain advances, so most heights never got a block:{N} key — they aged out of the in-memory window into permanent storage gaps, stranding observer/fullnode GetBlocks sync on the missing height. Make the writer poll-driven: every 5s persist the newly-committed block range by chain membership (not by the apply result), via a new batched save_blocks — one MDBX write txn / one fsync for the whole range. A first attempt used per-block save_block, whose per-block full-env mdbx.sync() contended with the apply path's trie write txns and stalled consensus (3/s -> 0.1/s); batching collapses that to one fsync per tick. The full state blob (save_blockchain) now runs on a 60s cadence purely to bound load-time B2 replay, which already rebuilds accounts from the block:{N} keys; the graceful-shutdown path still writes the blob on clean exit. Verified on testnet: block rate holds ~2.9/s (no stall) and post-fix blocks stay servable below the in-memory window (0 gaps). * fix(storage): stop B3b overwriting total_minted with closed-form on load load_blockchain's B3b reconcile overwrote total_minted with a closed-form sum (TOTAL_PREMINE + Σ flat BLOCK_REWARD>>halvings per height). That assumes every block's coinbase.amount equals the flat reward, but blocks with a reduced/zero coinbase make the true minted — the sum of the proposer-stamped coinbase.amount, which block_executor.rs:795 accumulates live and which all running validators agree on — strictly LESS than the closed form (~3000 SRX at testnet h≈6.27M). total_minted feeds state_root, so the overwrite forced a divergent total_minted into the root on every load: an observer GetBlocks-#1e rejected every block, and a validator restart would have forked. The blob holds the canonical live value and B2 replay re-applies coinbase for the post-checkpoint range, so the value is already correct after load — the closed-form overwrite was redundant and wrong. Keep the comparison as an advisory warn; do not mutate. Verified: a node reloaded with this change reconciles total_minted to the live/consensus value instead of the inflated closed form. * feat(consensus): add one-time treasury rebase reconciliation (fork-gated) PROTOCOL_TREASURY drifted across validators (val2/val3 over-credited ~1-2 SRX) during the multipath-distribute_reward era (STATE_ROOT_V2_HEIGHT 2689134 → REWARD_APPLY_PATH_HEIGHT 6239300). Credit is single-path/deterministic since 6239300 so the drift is frozen, but because treasury is committed in the state_root trie (since 2689134) each node computes a divergent local state_root — observers #1e-reject every block (validators tolerate via apply-from-stash). Treasury is the SOLE divergent account (all others byte-identical fleet-wide). Add a SEPARATE one-time force-set in update_trie_for_block gated on new env TREASURY_REBASE_HEIGHT + TREASURY_REBASE_BALANCE (default u64::MAX = dormant, ships safe). It must NOT reuse STATE_ROOT_V2_HEIGHT: that var also drives the trie-INCLUSION cutoff (block.index >= STATE_ROOT_V2_HEIGHT), so moving it would retroactively drop treasury from the trie for the historical range and fork. At the activation height every node sets the same operator-set canonical → converge; deterministic so B2 replay re-applies it. Activation runbook: pick canonical (supply-consistent majority or history-recompute), halt-all, set the two env vars, simul-start, verify treasury agreement at the activation block. * feat(node): observer-tolerant state_root accept for fullnodes (gated) An observer/fullnode applies every block via add_block_from_peer (Peer); the Peer branch of the #1e state_root check rejected any block whose proposer-stamped root differed from the local recompute, which on a chain whose state-commitment is imperfect (recurring/oscillating state_root that validators already tolerate via apply-from-stash, since consensus is on block_hash not state_root) means the observer rejects EVERY block and can never sync — even from a clean cp with byte-identical blocks and a canonical trie root_at_version. Gate a tolerant path behind SENTRIX_OBSERVER_TOLERANT_STATE_ROOT=1 (default OFF, validators unchanged): for a Peer block — which has already passed the strict 2/3 precommit justification verification earlier in add_block_impl, so it IS the network-agreed canonical block — accept on #1e by stamping the received (proposer's, canonical) root and returning Ok instead of Err, logging the local divergence at debug. The observer's local accounts already diverge from that root (the same pre-existing commitment imperfection every node has), so served state is no worse than a validator's. Rejecting a 2/3-justified block is halting on canonical data; accepting+stamping keeps the observer's chain consistent with the committed roots so it can sync and serve RPC. Validated on the testnet fullnode: ok=261 err=0 (was ok=0 err≈20), CRITICAL #1e=0, catching up at ~10 blk/s. Does not weaken the justification verify; validators keep strict #1e (env off). * test(storage): update B3b test for advisory no-overwrite behavior fix-B made B3b advisory (no longer overwrites total_minted with the closed-form, which over-counts on reduced-coinbase chains and feeds a divergent state_root). The old test asserted the overwrite; flip it to assert the persisted blob value survives load untouched. Sole failing test in CI (272 passed, 1 failed); all 12 storage tests green locally.

github-actions Bot enabled auto-merge (squash) June 5, 2026 19:33

coderabbitai Bot reviewed Jun 5, 2026

View reviewed changes

github-actions Bot merged commit 2e08be8 into main Jun 5, 2026
19 checks passed

coderabbitai Bot mentioned this pull request Jun 6, 2026

fix(consensus): accept justified peer block on #1e by default (no env) #801

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(consensus): restore staking state on Pass-2 rollback#792

fix(consensus): restore staking state on Pass-2 rollback#792
github-actions[bot] merged 1 commit into
mainfrom
fix/pass2-rollback-restore-staking-state

satyakwok commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

codecov Bot commented Jun 5, 2026

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

satyakwok commented Jun 5, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Validation

Follow-up (this PR)

Summary by CodeRabbit

Uh oh!

codecov Bot commented Jun 5, 2026

Codecov Report

Uh oh!

coderabbitai Bot commented Jun 5, 2026

Walkthrough

Estimated code review effort

Possibly related issues

Possibly related PRs

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

satyakwok commented Jun 5, 2026 •

edited by coderabbitai Bot

Loading