fix: testnet stabilization — liveness, persistence, determinism reconciliation (post-#792)#795
Conversation
Pads past STATE_ROOT_FORK_HEIGHT, runs the reward bundle, then forces a #1e via a tampered state_root and asserts pending_rewards rolls back. Fails before the snapshot fix (leak), passes after.
On FinalizeBlock for a peer-proposed block the node already held the block (it had voted for it; hash matches the action's block_hash, justification carries the 2/3 precommit certificate) but refused to apply it unless it was the local proposer — instead triggering libp2p sync and breaking. When NewBlock gossip missed, the node sat in Finalize re-requesting a block it already had, stalling/crawling the chain. Apply it locally like the self-propose arm; validate_block still re-checks structure + justification supermajority before the write. No state_root/format change.
The apply-from-stash path commits a peer-proposed finalized block via the SelfProduced source; the stashed proposal carries the proposer's pre-apply state_root, which never matches the freshly computed post-apply root, so it trips the #1e CHECK every block (it still self-heals via the libp2p receive path, which commits the canonical block). That flooded the logs ~1/block and falsely drove the DivergenceTracker alarm. Keep the LOUD alarm + tracker only for a Peer-source mismatch (a real cross-node divergence); log the self-apply case at debug and skip the tracker. Behavior unchanged (still returns Err → receive-path commit); only logging/metrics change.
The save-writer only persisted block:{N} keys when a finalized height was
pushed onto save_tx, but that push lives in the commit path that is skipped
whenever add_block returns Err on the BFT apply-from-stash state_root
recompute mismatch. The block is still canonical (2/3 precommit
justification) and the chain advances, so most heights never got a block:{N}
key — they aged out of the in-memory window into permanent storage gaps,
stranding observer/fullnode GetBlocks sync on the missing height.
Make the writer poll-driven: every 5s persist the newly-committed block
range by chain membership (not by the apply result), via a new batched
save_blocks — one MDBX write txn / one fsync for the whole range. A first
attempt used per-block save_block, whose per-block full-env mdbx.sync()
contended with the apply path's trie write txns and stalled consensus
(3/s -> 0.1/s); batching collapses that to one fsync per tick. The full
state blob (save_blockchain) now runs on a 60s cadence purely to bound
load-time B2 replay, which already rebuilds accounts from the block:{N}
keys; the graceful-shutdown path still writes the blob on clean exit.
Verified on testnet: block rate holds ~2.9/s (no stall) and post-fix
blocks stay servable below the in-memory window (0 gaps).
load_blockchain's B3b reconcile overwrote total_minted with a closed-form sum (TOTAL_PREMINE + Σ flat BLOCK_REWARD>>halvings per height). That assumes every block's coinbase.amount equals the flat reward, but blocks with a reduced/zero coinbase make the true minted — the sum of the proposer-stamped coinbase.amount, which block_executor.rs:795 accumulates live and which all running validators agree on — strictly LESS than the closed form (~3000 SRX at testnet h≈6.27M). total_minted feeds state_root, so the overwrite forced a divergent total_minted into the root on every load: an observer GetBlocks-#1e rejected every block, and a validator restart would have forked. The blob holds the canonical live value and B2 replay re-applies coinbase for the post-checkpoint range, so the value is already correct after load — the closed-form overwrite was redundant and wrong. Keep the comparison as an advisory warn; do not mutate. Verified: a node reloaded with this change reconciles total_minted to the live/consensus value instead of the inflated closed form.
|
Warning Review limit reached
More reviews will be available in 38 minutes and 36 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…ted) PROTOCOL_TREASURY drifted across validators (val2/val3 over-credited ~1-2 SRX) during the multipath-distribute_reward era (STATE_ROOT_V2_HEIGHT 2689134 → REWARD_APPLY_PATH_HEIGHT 6239300). Credit is single-path/deterministic since 6239300 so the drift is frozen, but because treasury is committed in the state_root trie (since 2689134) each node computes a divergent local state_root — observers #1e-reject every block (validators tolerate via apply-from-stash). Treasury is the SOLE divergent account (all others byte-identical fleet-wide). Add a SEPARATE one-time force-set in update_trie_for_block gated on new env TREASURY_REBASE_HEIGHT + TREASURY_REBASE_BALANCE (default u64::MAX = dormant, ships safe). It must NOT reuse STATE_ROOT_V2_HEIGHT: that var also drives the trie-INCLUSION cutoff (block.index >= STATE_ROOT_V2_HEIGHT), so moving it would retroactively drop treasury from the trie for the historical range and fork. At the activation height every node sets the same operator-set canonical → converge; deterministic so B2 replay re-applies it. Activation runbook: pick canonical (supply-consistent majority or history-recompute), halt-all, set the two env vars, simul-start, verify treasury agreement at the activation block.
An observer/fullnode applies every block via add_block_from_peer (Peer); the Peer branch of the #1e state_root check rejected any block whose proposer-stamped root differed from the local recompute, which on a chain whose state-commitment is imperfect (recurring/oscillating state_root that validators already tolerate via apply-from-stash, since consensus is on block_hash not state_root) means the observer rejects EVERY block and can never sync — even from a clean cp with byte-identical blocks and a canonical trie root_at_version. Gate a tolerant path behind SENTRIX_OBSERVER_TOLERANT_STATE_ROOT=1 (default OFF, validators unchanged): for a Peer block — which has already passed the strict 2/3 precommit justification verification earlier in add_block_impl, so it IS the network-agreed canonical block — accept on #1e by stamping the received (proposer's, canonical) root and returning Ok instead of Err, logging the local divergence at debug. The observer's local accounts already diverge from that root (the same pre-existing commitment imperfection every node has), so served state is no worse than a validator's. Rejecting a 2/3-justified block is halting on canonical data; accepting+stamping keeps the observer's chain consistent with the committed roots so it can sync and serve RPC. Validated on the testnet fullnode: ok=261 err=0 (was ok=0 err≈20), CRITICAL #1e=0, catching up at ~10 blk/s. Does not weaken the justification verify; validators keep strict #1e (env off).
277e56c to
458e815
Compare
fix-B made B3b advisory (no longer overwrites total_minted with the closed-form, which over-counts on reduced-coinbase chains and feeds a divergent state_root). The old test asserted the overwrite; flip it to assert the persisted blob value survives load untouched. Sole failing test in CI (272 passed, 1 failed); all 12 storage tests green locally.
#801) * fix(consensus): accept justified peer block on #1e by default (no env) Replaces the per-node SENTRIX_OBSERVER_TOLERANT_STATE_ROOT env gate (#795) with a uniform, default behavior for ALL nodes: on a #1e state_root mismatch for a Peer block, ACCEPT it (stamp the proposer's root) iff the block carries a justification — which means it already passed the 2/3 justification gate in add_block_impl before pass-2 (an absent/insufficient justification is rejected there). A no-justification Peer block still hits the strict #1e reject. Why: consensus is on block_hash (2/3 justification), not state_root. The chain's state-commitment is imperfect (recurring state_root), so a node that falls behind and catches up via sync recomputes a different root and the strict #1e REJECTED the canonical block → the validator got stuck syncing → chain stalled until halt-all (val1, 2026-06-06). The env gate made only the fullnode tolerant, so validators kept stalling; making it the default fixes the stuck-sync at the root and removes the val-vs-fullnode divergence. Safety: this only changes the accept/reject decision on already-canonical (justification-gated) blocks; the stamped root is identical either way, so no computed-state change and no fork gate needed. The accept inherits the chain's existing justification trust level (the add_block_impl gate) — no weaker. NOTE: cryptographic per-signature justification verification (STRICT_JUSTIFICATION_HEIGHT) is a separate, currently-broken path (recover_signer disagrees with the producer's signing payload — validated 2026-06-06: enabling it rejects legitimate blocks); fixing that to enable strict verification fleet-wide is a tracked follow-up. Tests: justified Peer block + tampered root past fork height → ACCEPTED, root stamped; the no-justification counterpart → REJECTED (the updated C-03 rollback test). Both use the pad-past-STATE_ROOT_FORK_HEIGHT harness. sentrix-core 264 passed; clippy --all-targets -D warnings clean. * fix(consensus): plumb Pass-1 justification-verified flag into #1e accept (CodeRabbit) CodeRabbit #801 Major: keying the #1e accept off block.justification.is_some() was imprecise — add_block_impl SKIPS the 2/3 gate when SENTRIX_REPLAY_BYPASS_AUTHZ is set or voyager_mode_for(height) is false, so a peer block could carry an arbitrary (unvalidated) justification blob, reach Pass-2, and be accepted on #1e without ever passing BFT validation. Fix: a per-add transient `current_add_justification_verified` (serde-skipped), reset at the top of add_block_impl and set true ONLY at the end of the gate block — i.e. only when the gate actually RAN and PASSED. Pass-2's #1e accept now reads that flag instead of justification presence, so it tolerates a state_root mismatch strictly on gate-verified blocks; replay / pre-voyager / no-justification blocks fall through to the strict reject. Tests unchanged + still green (justified→accept goes through the gate→flag set; unjustified→reject, gate skipped→flag false). sentrix-core suite + clippy --all-targets -D warnings clean.
…802) test_c03_pass2_failure_rolls_back_state credits v1 to u64::MAX - reward and relies on the coinbase credit overflowing in Pass 2. It read get_block_reward() (which depends on the reward-fork env vars: VOYAGER_REWARD_V2_HEIGHT / TOKENOMICS_V2_HEIGHT / halving) WITHOUT holding env_test_lock, while sibling tests set those vars under that lock. Under the parallel `cargo test` run (e.g. the report-only coverage job) a concurrent mutation changed `reward` mid-test, the overflow didn't trigger, add_block returned Ok, and unwrap_err panicked — the recurring CI flake that needed re-runs on #795/#798/#800/#801. Acquire env_test_lock at the top of the test so it serializes against the env-mutating tests. Test-only change. sentrix-core suite + clippy --all-targets -D warnings clean.
Clean rebase of the 2026-06-06 testnet stabilization work onto current main (the prior branch conflicted — it carried #792's already-merged commit + a revert-pair). 7 net commits, cherry-picked clean,
cargo check --workspace -D warningsgreen. All validated live on testnet (validators #1e=0, identical state_roots, 5 nodes synced, fullnode synced).Changes
save_blocks+ unit test.Safety
Supersedes #794 (conflicting).