Skip to content

fix: testnet stabilization — liveness, persistence, determinism reconciliation (post-#792)#795

Merged
github-actions[bot] merged 8 commits into
mainfrom
merge/testnet-stabilization
Jun 6, 2026
Merged

fix: testnet stabilization — liveness, persistence, determinism reconciliation (post-#792)#795
github-actions[bot] merged 8 commits into
mainfrom
merge/testnet-stabilization

Conversation

@satyakwok
Copy link
Copy Markdown
Member

Clean rebase of the 2026-06-06 testnet stabilization work onto current main (the prior branch conflicted — it carried #792's already-merged commit + a revert-pair). 7 net commits, cherry-picked clean, cargo check --workspace -D warnings green. All validated live on testnet (validators #1e=0, identical state_roots, 5 nodes synced, fullnode synced).

Changes

  • Liveness: apply finalized peer block from local stash (unsticks finalize crawl) + quiet expected self-apply #1e.
  • Persistence: poll-driven batched block persister (1 WriteBatch/1 fsync per tick) — closes the block:{N} gaps that stranded observer sync; save_blocks + unit test.
  • total_minted: stop B3b overwriting with the closed-form on load (closed form over-counts vs live stamped-coinbase sum → divergent root on reload / latent fork-on-restart). Trust blob + B2 replay.
  • TREASURY_REBASE (fork-gated, default OFF): separate one-time PROTOCOL_TREASURY reconciliation; does NOT reuse STATE_ROOT_V2_HEIGHT (would move trie inclusion + fork).
  • Observer-tolerant (gated, default OFF, SENTRIX_OBSERVER_TOLERANT_STATE_ROOT): a fullnode that verified a block's 2/3 justification accepts + stamps the proposer's canonical root on local #1e instead of rejecting, so it can sync a chain with imperfect state-commitment. Validators keep strict #1e.

Safety

  • TREASURY_REBASE + observer-tolerant are env-gated, default off → no behavior change for validators/mainnet without explicit chain-specific activation.
  • Liveness + #1e-quiet baked on testnet this session.

Supersedes #794 (conflicting).

satyakwok added 5 commits June 6, 2026 07:55
Pads past STATE_ROOT_FORK_HEIGHT, runs the reward bundle, then forces a
#1e via a tampered state_root and asserts pending_rewards rolls back.
Fails before the snapshot fix (leak), passes after.
On FinalizeBlock for a peer-proposed block the node already held the block
(it had voted for it; hash matches the action's block_hash, justification
carries the 2/3 precommit certificate) but refused to apply it unless it
was the local proposer — instead triggering libp2p sync and breaking. When
NewBlock gossip missed, the node sat in Finalize re-requesting a block it
already had, stalling/crawling the chain. Apply it locally like the
self-propose arm; validate_block still re-checks structure + justification
supermajority before the write. No state_root/format change.
The apply-from-stash path commits a peer-proposed finalized block via the
SelfProduced source; the stashed proposal carries the proposer's pre-apply
state_root, which never matches the freshly computed post-apply root, so it
trips the #1e CHECK every block (it still self-heals via the libp2p receive
path, which commits the canonical block). That flooded the logs ~1/block and
falsely drove the DivergenceTracker alarm. Keep the LOUD alarm + tracker only
for a Peer-source mismatch (a real cross-node divergence); log the self-apply
case at debug and skip the tracker. Behavior unchanged (still returns Err →
receive-path commit); only logging/metrics change.
The save-writer only persisted block:{N} keys when a finalized height was
pushed onto save_tx, but that push lives in the commit path that is skipped
whenever add_block returns Err on the BFT apply-from-stash state_root
recompute mismatch. The block is still canonical (2/3 precommit
justification) and the chain advances, so most heights never got a block:{N}
key — they aged out of the in-memory window into permanent storage gaps,
stranding observer/fullnode GetBlocks sync on the missing height.

Make the writer poll-driven: every 5s persist the newly-committed block
range by chain membership (not by the apply result), via a new batched
save_blocks — one MDBX write txn / one fsync for the whole range. A first
attempt used per-block save_block, whose per-block full-env mdbx.sync()
contended with the apply path's trie write txns and stalled consensus
(3/s -> 0.1/s); batching collapses that to one fsync per tick. The full
state blob (save_blockchain) now runs on a 60s cadence purely to bound
load-time B2 replay, which already rebuilds accounts from the block:{N}
keys; the graceful-shutdown path still writes the blob on clean exit.

Verified on testnet: block rate holds ~2.9/s (no stall) and post-fix
blocks stay servable below the in-memory window (0 gaps).
load_blockchain's B3b reconcile overwrote total_minted with a closed-form
sum (TOTAL_PREMINE + Σ flat BLOCK_REWARD>>halvings per height). That assumes
every block's coinbase.amount equals the flat reward, but blocks with a
reduced/zero coinbase make the true minted — the sum of the proposer-stamped
coinbase.amount, which block_executor.rs:795 accumulates live and which all
running validators agree on — strictly LESS than the closed form (~3000 SRX
at testnet h≈6.27M). total_minted feeds state_root, so the overwrite forced a
divergent total_minted into the root on every load: an observer GetBlocks-#1e
rejected every block, and a validator restart would have forked. The blob
holds the canonical live value and B2 replay re-applies coinbase for the
post-checkpoint range, so the value is already correct after load — the
closed-form overwrite was redundant and wrong. Keep the comparison as an
advisory warn; do not mutate. Verified: a node reloaded with this change
reconciles total_minted to the live/consensus value instead of the inflated
closed form.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 6, 2026

Warning

Review limit reached

@satyakwok, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 38 minutes and 36 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 4d4fda09-fcf7-44d7-a5da-521b2693cbc9

📥 Commits

Reviewing files that changed from the base of the PR and between 2e08be8 and abada53.

📒 Files selected for processing (6)
  • Cargo.toml
  • bin/sentrix/src/main.rs
  • crates/sentrix-core/src/block_executor.rs
  • crates/sentrix-core/src/blockchain_trie_ops.rs
  • crates/sentrix-core/src/storage.rs
  • crates/sentrix-storage/src/chain.rs
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch merge/testnet-stabilization

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot enabled auto-merge (squash) June 6, 2026 05:57
@satyakwok satyakwok self-assigned this Jun 6, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

satyakwok added 2 commits June 6, 2026 08:00
…ted)

PROTOCOL_TREASURY drifted across validators (val2/val3 over-credited ~1-2 SRX)
during the multipath-distribute_reward era (STATE_ROOT_V2_HEIGHT 2689134 →
REWARD_APPLY_PATH_HEIGHT 6239300). Credit is single-path/deterministic since
6239300 so the drift is frozen, but because treasury is committed in the
state_root trie (since 2689134) each node computes a divergent local state_root
— observers #1e-reject every block (validators tolerate via apply-from-stash).
Treasury is the SOLE divergent account (all others byte-identical fleet-wide).

Add a SEPARATE one-time force-set in update_trie_for_block gated on new env
TREASURY_REBASE_HEIGHT + TREASURY_REBASE_BALANCE (default u64::MAX = dormant,
ships safe). It must NOT reuse STATE_ROOT_V2_HEIGHT: that var also drives the
trie-INCLUSION cutoff (block.index >= STATE_ROOT_V2_HEIGHT), so moving it would
retroactively drop treasury from the trie for the historical range and fork.
At the activation height every node sets the same operator-set canonical →
converge; deterministic so B2 replay re-applies it. Activation runbook: pick
canonical (supply-consistent majority or history-recompute), halt-all, set the
two env vars, simul-start, verify treasury agreement at the activation block.
An observer/fullnode applies every block via add_block_from_peer (Peer);
the Peer branch of the #1e state_root check rejected any block whose
proposer-stamped root differed from the local recompute, which on a chain
whose state-commitment is imperfect (recurring/oscillating state_root that
validators already tolerate via apply-from-stash, since consensus is on
block_hash not state_root) means the observer rejects EVERY block and can
never sync — even from a clean cp with byte-identical blocks and a canonical
trie root_at_version.

Gate a tolerant path behind SENTRIX_OBSERVER_TOLERANT_STATE_ROOT=1 (default
OFF, validators unchanged): for a Peer block — which has already passed the
strict 2/3 precommit justification verification earlier in add_block_impl,
so it IS the network-agreed canonical block — accept on #1e by stamping the
received (proposer's, canonical) root and returning Ok instead of Err,
logging the local divergence at debug. The observer's local accounts already
diverge from that root (the same pre-existing commitment imperfection every
node has), so served state is no worse than a validator's. Rejecting a
2/3-justified block is halting on canonical data; accepting+stamping keeps
the observer's chain consistent with the committed roots so it can sync and
serve RPC. Validated on the testnet fullnode: ok=261 err=0 (was ok=0
err≈20), CRITICAL #1e=0, catching up at ~10 blk/s. Does not weaken the
justification verify; validators keep strict #1e (env off).
@satyakwok satyakwok force-pushed the merge/testnet-stabilization branch from 277e56c to 458e815 Compare June 6, 2026 06:01
fix-B made B3b advisory (no longer overwrites total_minted with the closed-form, which over-counts on reduced-coinbase chains and feeds a divergent state_root). The old test asserted the overwrite; flip it to assert the persisted blob value survives load untouched. Sole failing test in CI (272 passed, 1 failed); all 12 storage tests green locally.
@github-actions github-actions Bot merged commit 07d8def into main Jun 6, 2026
19 checks passed
@satyakwok satyakwok deleted the merge/testnet-stabilization branch June 6, 2026 07:22
github-actions Bot pushed a commit that referenced this pull request Jun 6, 2026
#801)

* fix(consensus): accept justified peer block on #1e by default (no env)

Replaces the per-node SENTRIX_OBSERVER_TOLERANT_STATE_ROOT env gate (#795)
with a uniform, default behavior for ALL nodes: on a #1e state_root mismatch
for a Peer block, ACCEPT it (stamp the proposer's root) iff the block carries
a justification — which means it already passed the 2/3 justification gate in
add_block_impl before pass-2 (an absent/insufficient justification is rejected
there). A no-justification Peer block still hits the strict #1e reject.

Why: consensus is on block_hash (2/3 justification), not state_root. The
chain's state-commitment is imperfect (recurring state_root), so a node that
falls behind and catches up via sync recomputes a different root and the
strict #1e REJECTED the canonical block → the validator got stuck syncing →
chain stalled until halt-all (val1, 2026-06-06). The env gate made only the
fullnode tolerant, so validators kept stalling; making it the default fixes
the stuck-sync at the root and removes the val-vs-fullnode divergence.

Safety: this only changes the accept/reject decision on already-canonical
(justification-gated) blocks; the stamped root is identical either way, so no
computed-state change and no fork gate needed. The accept inherits the chain's
existing justification trust level (the add_block_impl gate) — no weaker.
NOTE: cryptographic per-signature justification verification
(STRICT_JUSTIFICATION_HEIGHT) is a separate, currently-broken path
(recover_signer disagrees with the producer's signing payload — validated
2026-06-06: enabling it rejects legitimate blocks); fixing that to enable
strict verification fleet-wide is a tracked follow-up.

Tests: justified Peer block + tampered root past fork height → ACCEPTED, root
stamped; the no-justification counterpart → REJECTED (the updated C-03
rollback test). Both use the pad-past-STATE_ROOT_FORK_HEIGHT harness.
sentrix-core 264 passed; clippy --all-targets -D warnings clean.

* fix(consensus): plumb Pass-1 justification-verified flag into #1e accept (CodeRabbit)

CodeRabbit #801 Major: keying the #1e accept off block.justification.is_some()
was imprecise — add_block_impl SKIPS the 2/3 gate when SENTRIX_REPLAY_BYPASS_AUTHZ
is set or voyager_mode_for(height) is false, so a peer block could carry an
arbitrary (unvalidated) justification blob, reach Pass-2, and be accepted on
#1e without ever passing BFT validation.

Fix: a per-add transient `current_add_justification_verified` (serde-skipped),
reset at the top of add_block_impl and set true ONLY at the end of the gate
block — i.e. only when the gate actually RAN and PASSED. Pass-2's #1e accept
now reads that flag instead of justification presence, so it tolerates a
state_root mismatch strictly on gate-verified blocks; replay / pre-voyager /
no-justification blocks fall through to the strict reject.

Tests unchanged + still green (justified→accept goes through the gate→flag set;
unjustified→reject, gate skipped→flag false). sentrix-core suite + clippy
--all-targets -D warnings clean.
github-actions Bot pushed a commit that referenced this pull request Jun 6, 2026
…802)

test_c03_pass2_failure_rolls_back_state credits v1 to u64::MAX - reward and
relies on the coinbase credit overflowing in Pass 2. It read get_block_reward()
(which depends on the reward-fork env vars: VOYAGER_REWARD_V2_HEIGHT /
TOKENOMICS_V2_HEIGHT / halving) WITHOUT holding env_test_lock, while sibling
tests set those vars under that lock. Under the parallel `cargo test` run (e.g.
the report-only coverage job) a concurrent mutation changed `reward` mid-test,
the overflow didn't trigger, add_block returned Ok, and unwrap_err panicked —
the recurring CI flake that needed re-runs on #795/#798/#800/#801.

Acquire env_test_lock at the top of the test so it serializes against the
env-mutating tests. Test-only change. sentrix-core suite + clippy --all-targets
-D warnings clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant