Skip to content

fix(consensus): restore staking state on Pass-2 rollback#792

Merged
github-actions[bot] merged 1 commit into
mainfrom
fix/pass2-rollback-restore-staking-state
Jun 5, 2026
Merged

fix(consensus): restore staking state on Pass-2 rollback#792
github-actions[bot] merged 1 commit into
mainfrom
fix/pass2-rollback-restore-staking-state

Conversation

@satyakwok
Copy link
Copy Markdown
Member

@satyakwok satyakwok commented Jun 5, 2026

Problem

Validators on the internal testnet computed different state_roots for the
same block+parent after both the state-in-trie and reward-in-apply forks
activated, producing a stream of #1e state_root mismatches and reducing
block production to a crawl. Committed history stayed consistent, but the
chain could not sustain finalization.

Root cause

apply_block_pass2 runs the centralized reward bundle
(apply_reward_bookkeeping_for_latest_block) before the #1e
state_root check. That bundle mutates stake_registry (pending_rewards),
epoch_manager, and slashing (liveness).

The C-03 Pass-2 rollback snapshot (BlockchainSnapshot) captured
accounts / contracts / nft_registry / authority / mempool / total_minted / chain_len / trie_root — but not those three. Once STATE_IN_TRIE
activated, pending_rewards / epoch / liveness are committed into the
state_root. So every rejected (#1e) block left them incremented; the
leaked values then diverged the next block's computed root, compounding
per-node. This is the interaction of the state-in-trie fork, the
reward-in-apply fork, and the older (pre-both) rollback snapshot.

Fix

Snapshot and restore stake_registry, epoch_manager, and slashing in
the Pass-2 rollback path, so the rollback is atomic over every input
the state_root now commits. Success path is unchanged; only the reject
path is made complete. No change to the committed-block root formula, so no
fork gate is required — deploy fleet-wide.

Verified every state_root input in update_trie_for_block is now covered:
accounts, per-validator pending_rewards, liveness, epoch_state, total_minted,
SRC-20 / NFT registry hashes.

Validation

Deployed to the 4-validator internal testnet: #1e mismatches dropped from
~1/s to 0 across all validators, and committed state_root is identical
across all nodes at every height.

Follow-up (this PR)

Regression test: a #1e reject after the reward bundle runs, asserting
pending_rewards / epoch / liveness are restored to their pre-block values.

Summary by CodeRabbit

  • Chores

    • Bumped version to 2.2.31
  • Bug Fixes

    • Enhanced state rollback mechanism to prevent staking state mutations (rewards, epochs, liveness, slashing) from persisting when block application fails, ensuring proper cleanup of failed blocks.

apply_block_pass2 mutates stake_registry / epoch_manager / slashing via
the centralized reward bundle before the #1e state_root check, but the
C-03 rollback snapshot didn't capture them. Post STATE_IN_TRIE these feed
the state_root, so a #1e reject left pending_rewards / epoch / liveness
incremented — that leak then diverged the next block's computed root and
crawled the chain once both forks were live. Snapshot + restore them so
the Pass-2 rollback is atomic over every state_root input.

Bump workspace 2.2.30 -> 2.2.31.
@github-actions github-actions Bot enabled auto-merge (squash) June 5, 2026 19:33
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 5, 2026

Too much diff to scan? Review this PR in Change Stack to start with the highest-impact changes.

Review Change Stack

📝 Walkthrough

Walkthrough

This PR extends Block Pass 2 rollback recovery to include staking subsystem state. When block application fails partway through Pass 2 execution, the snapshot now captures and restores stake_registry, epoch_manager, and slashing in addition to existing account/contract/authority state. This prevents pending staking mutations—such as reward accumulation, epoch transitions, and liveness/slashing updates—from leaking into subsequent blocks after a failed block commit.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

  • sentrix-labs/sentrix#782: Both PRs work with staking state mutations in apply_block_pass2 (stake_registry, epoch_manager, slashing)—this PR ensures failed mutations are rolled back via snapshot restoration, while PR #782 gates reward/epoch bookkeeping operations.

  • sentrix-labs/sentrix#763: Both PRs strengthen staking/epoch/slashing consistency during block processing—this PR extends Pass 2 rollback to restore staking state, while PR #763 wires epoch bookkeeping (liveness/slashing) for libp2p sync paths.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The description is detailed and well-structured, covering problem, root cause, fix, and validation. However, it does not follow the repository's required template with sections like Scope, Checks, Linked issue, and Deploy impact. Reformat the description to follow the repository template: add Scope checkboxes, Checks checklist, Linked issue reference, and Deploy impact section.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately and concisely describes the main fix: restoring staking state during Pass-2 rollback to prevent consensus divergence.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/pass2-rollback-restore-staking-state

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/sentrix-core/src/block_executor.rs`:
- Around line 620-622: The non-rollbackable side effects in apply_block_pass2
(notably writes to TABLE_BLOOM and subscriber notifications like emit_new_head /
emit_finalized) must be deferred until after the state_root check that can still
reject the block; modify apply_block_pass2 to buffer TABLE_BLOOM updates and all
subscriber emits (or gate them behind a success flag) and perform those buffered
writes/emits only after the state_root verification passes, and similarly ensure
the rollback that restores stake_registry, epoch_manager, and slashing (and the
analogous code around the earlier restore at the other mentioned spot) remains
unchanged but occurs before any buffered side-effects are flushed so no phantom
data or notifications escape a failed state_root.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 7e91818e-7540-44d3-8d7f-857101e1ad7e

📥 Commits

Reviewing files that changed from the base of the PR and between a132baf and b9903c1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (2)
  • Cargo.toml
  • crates/sentrix-core/src/block_executor.rs

Comment on lines +620 to +622
stake_registry: self.stake_registry.clone(),
epoch_manager: self.epoch_manager.clone(),
slashing: self.slashing.clone(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Rollback is still non-atomic for pre-#1e side effects.

Restoring stake_registry, epoch_manager, and slashing closes the consensus leak, but apply_block_pass2 still performs success-only side effects before the later state_root reject path can fire. In this same function, TABLE_BLOOM is written at Lines 1507-1517 and subscriber notifications are emitted throughout Pass 2, including emit_new_head / emit_finalized at Lines 1573-1585, while the #1e rejection still happens later at Lines 1632-1748. A block that fails the root check will now rewind staking state but can still leave phantom query data or notify clients about a block that never committed. Please defer those non-rollbackable writes/emits until after the state_root check, or buffer them behind the success path. As per coding guidelines, crates/sentrix-core/src/block_executor** is consensus-critical and state-apply must be deterministic.

Also applies to: 669-675

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/sentrix-core/src/block_executor.rs` around lines 620 - 622, The
non-rollbackable side effects in apply_block_pass2 (notably writes to
TABLE_BLOOM and subscriber notifications like emit_new_head / emit_finalized)
must be deferred until after the state_root check that can still reject the
block; modify apply_block_pass2 to buffer TABLE_BLOOM updates and all subscriber
emits (or gate them behind a success flag) and perform those buffered
writes/emits only after the state_root verification passes, and similarly ensure
the rollback that restores stake_registry, epoch_manager, and slashing (and the
analogous code around the earlier restore at the other mentioned spot) remains
unchanged but occurs before any buffered side-effects are flushed so no phantom
data or notifications escape a failed state_root.

@github-actions github-actions Bot merged commit 2e08be8 into main Jun 5, 2026
19 checks passed
github-actions Bot pushed a commit that referenced this pull request Jun 6, 2026
…ciliation (post-#792) (#795)

* test(consensus): regression for Pass-2 rollback restoring staking state

Pads past STATE_ROOT_FORK_HEIGHT, runs the reward bundle, then forces a
#1e via a tampered state_root and asserts pending_rewards rolls back.
Fails before the snapshot fix (leak), passes after.

* fix(consensus): apply finalized peer block from local stash

On FinalizeBlock for a peer-proposed block the node already held the block
(it had voted for it; hash matches the action's block_hash, justification
carries the 2/3 precommit certificate) but refused to apply it unless it
was the local proposer — instead triggering libp2p sync and breaking. When
NewBlock gossip missed, the node sat in Finalize re-requesting a block it
already had, stalling/crawling the chain. Apply it locally like the
self-propose arm; validate_block still re-checks structure + justification
supermajority before the write. No state_root/format change.

* fix(consensus): quiet expected #1e on finalize apply-from-stash

The apply-from-stash path commits a peer-proposed finalized block via the
SelfProduced source; the stashed proposal carries the proposer's pre-apply
state_root, which never matches the freshly computed post-apply root, so it
trips the #1e CHECK every block (it still self-heals via the libp2p receive
path, which commits the canonical block). That flooded the logs ~1/block and
falsely drove the DivergenceTracker alarm. Keep the LOUD alarm + tracker only
for a Peer-source mismatch (a real cross-node divergence); log the self-apply
case at debug and skip the tracker. Behavior unchanged (still returns Err →
receive-path commit); only logging/metrics change.

* fix(storage): poll-driven batched block persistence to close sync gaps

The save-writer only persisted block:{N} keys when a finalized height was
pushed onto save_tx, but that push lives in the commit path that is skipped
whenever add_block returns Err on the BFT apply-from-stash state_root
recompute mismatch. The block is still canonical (2/3 precommit
justification) and the chain advances, so most heights never got a block:{N}
key — they aged out of the in-memory window into permanent storage gaps,
stranding observer/fullnode GetBlocks sync on the missing height.

Make the writer poll-driven: every 5s persist the newly-committed block
range by chain membership (not by the apply result), via a new batched
save_blocks — one MDBX write txn / one fsync for the whole range. A first
attempt used per-block save_block, whose per-block full-env mdbx.sync()
contended with the apply path's trie write txns and stalled consensus
(3/s -> 0.1/s); batching collapses that to one fsync per tick. The full
state blob (save_blockchain) now runs on a 60s cadence purely to bound
load-time B2 replay, which already rebuilds accounts from the block:{N}
keys; the graceful-shutdown path still writes the blob on clean exit.

Verified on testnet: block rate holds ~2.9/s (no stall) and post-fix
blocks stay servable below the in-memory window (0 gaps).

* fix(storage): stop B3b overwriting total_minted with closed-form on load

load_blockchain's B3b reconcile overwrote total_minted with a closed-form
sum (TOTAL_PREMINE + Σ flat BLOCK_REWARD>>halvings per height). That assumes
every block's coinbase.amount equals the flat reward, but blocks with a
reduced/zero coinbase make the true minted — the sum of the proposer-stamped
coinbase.amount, which block_executor.rs:795 accumulates live and which all
running validators agree on — strictly LESS than the closed form (~3000 SRX
at testnet h≈6.27M). total_minted feeds state_root, so the overwrite forced a
divergent total_minted into the root on every load: an observer GetBlocks-#1e
rejected every block, and a validator restart would have forked. The blob
holds the canonical live value and B2 replay re-applies coinbase for the
post-checkpoint range, so the value is already correct after load — the
closed-form overwrite was redundant and wrong. Keep the comparison as an
advisory warn; do not mutate. Verified: a node reloaded with this change
reconciles total_minted to the live/consensus value instead of the inflated
closed form.

* feat(consensus): add one-time treasury rebase reconciliation (fork-gated)

PROTOCOL_TREASURY drifted across validators (val2/val3 over-credited ~1-2 SRX)
during the multipath-distribute_reward era (STATE_ROOT_V2_HEIGHT 2689134 →
REWARD_APPLY_PATH_HEIGHT 6239300). Credit is single-path/deterministic since
6239300 so the drift is frozen, but because treasury is committed in the
state_root trie (since 2689134) each node computes a divergent local state_root
— observers #1e-reject every block (validators tolerate via apply-from-stash).
Treasury is the SOLE divergent account (all others byte-identical fleet-wide).

Add a SEPARATE one-time force-set in update_trie_for_block gated on new env
TREASURY_REBASE_HEIGHT + TREASURY_REBASE_BALANCE (default u64::MAX = dormant,
ships safe). It must NOT reuse STATE_ROOT_V2_HEIGHT: that var also drives the
trie-INCLUSION cutoff (block.index >= STATE_ROOT_V2_HEIGHT), so moving it would
retroactively drop treasury from the trie for the historical range and fork.
At the activation height every node sets the same operator-set canonical →
converge; deterministic so B2 replay re-applies it. Activation runbook: pick
canonical (supply-consistent majority or history-recompute), halt-all, set the
two env vars, simul-start, verify treasury agreement at the activation block.

* feat(node): observer-tolerant state_root accept for fullnodes (gated)

An observer/fullnode applies every block via add_block_from_peer (Peer);
the Peer branch of the #1e state_root check rejected any block whose
proposer-stamped root differed from the local recompute, which on a chain
whose state-commitment is imperfect (recurring/oscillating state_root that
validators already tolerate via apply-from-stash, since consensus is on
block_hash not state_root) means the observer rejects EVERY block and can
never sync — even from a clean cp with byte-identical blocks and a canonical
trie root_at_version.

Gate a tolerant path behind SENTRIX_OBSERVER_TOLERANT_STATE_ROOT=1 (default
OFF, validators unchanged): for a Peer block — which has already passed the
strict 2/3 precommit justification verification earlier in add_block_impl,
so it IS the network-agreed canonical block — accept on #1e by stamping the
received (proposer's, canonical) root and returning Ok instead of Err,
logging the local divergence at debug. The observer's local accounts already
diverge from that root (the same pre-existing commitment imperfection every
node has), so served state is no worse than a validator's. Rejecting a
2/3-justified block is halting on canonical data; accepting+stamping keeps
the observer's chain consistent with the committed roots so it can sync and
serve RPC. Validated on the testnet fullnode: ok=261 err=0 (was ok=0
err≈20), CRITICAL #1e=0, catching up at ~10 blk/s. Does not weaken the
justification verify; validators keep strict #1e (env off).

* test(storage): update B3b test for advisory no-overwrite behavior

fix-B made B3b advisory (no longer overwrites total_minted with the closed-form, which over-counts on reduced-coinbase chains and feeds a divergent state_root). The old test asserted the overwrite; flip it to assert the persisted blob value survives load untouched. Sole failing test in CI (272 passed, 1 failed); all 12 storage tests green locally.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant