fix(trie): race-free offline prune + background prune off by default#800
Conversation
The durable fix for the recurring trie "missing node" stalls. Root cause (confirmed): the BACKGROUND prune runs on a thread cloning the trie at a frozen version while block apply keeps committing; a node committed or content-addressed-resurfaced during the multi-minute live-set walk is absent from the frozen snapshot and gets deleted as an orphan. Five partial fixes (#711 reload-before-gc, #714 collect_reachable depth, #791 split passes, #798 generational defer) each narrowed but never closed the window — #798 deleted live nodes again in production (val3, h=6300000 prune → "missing node" → propose stall). A truly race-free background prune needs walk+delete in one MDBX RW txn (a chain-blocking write lock for the 10-20 min walk) or refcounting — both deferred. This takes the mechanism-agnostic safe route: eliminate the concurrency. - `SentrixTrie::prune_offline(keep)` — same walk + keep-window as `prune`, minus the racy augment and the generational deferral, using the combined immediate `gc_orphaned_nodes`. Correct only with no concurrent commits, which is guaranteed by running it on a STOPPED node. - `sentrix chain prune [--keep N]` — operator runs it during a maintenance halt (same model as `chain reset-trie` / `verify-deep`). Safe on a single peer: deleting unreachable nodes does not change the state_root (which only commits reachable nodes), so no fork risk — unlike reset-trie. - Background prune is now OFF by default. It only runs with SENTRIX_ENABLE_BACKGROUND_TRIE_PRUNE=1 (and the legacy SENTRIX_DISABLE_TRIE_PRUNE=1 still force-disables). maybe_prune_trie early-returns by default, so the apply path no longer even spawns the racy thread. Trade-off: storage grows between maintenance prunes. Acceptable; correctness over convenience after five automatic-prune failures. Tests: prune_offline reclaims orphans AND the current value survives the prune (the regression guard #798 lacked — a live-node deletion would make the post-prune `get` fail with "missing node"); background_prune_enabled gate is off by default + opt-in semantics. cargo test -p sentrix-trie: 84 passed; -p sentrix-core: 263 passed; cargo check --workspace -D warnings clean.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes. Warning Review limit reached
More reviews will be available in 40 minutes and 3 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: ASSERTIVE Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughThis PR adds an offline trie garbage collection feature to prune unreachable nodes from long-running blockchain state, along with environment-variable gating to control background pruning behavior. A new Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@bin/sentrix/src/commands/chain.rs`:
- Around line 288-290: The handler currently only prints a warning and then
calls trie.prune_offline(keep); change it to fail closed by acquiring an
inter-process exclusivity/maintenance lock before calling trie.prune_offline:
attempt to open the MDBX environment or the chain DB path in a way that fails if
another writer is active (or create and lock a dedicated maintenance lock file
with an exclusive flock), check that lock acquisition succeeded, and return an
error (abort) if it did not; reference trie.prune_offline and the command
handler around where the two println! calls and trie.prune_offline(keep) are
invoked so the prune only runs when the exclusive lock is held.
- Around line 279-286: The pruning command currently calls bc.init_trie(...)
which can rebuild the trie from AccountDB; instead, refuse to proceed if there
is no already-persisted trie/root. Remove or skip the call to
bc.init_trie(Arc::clone(&mdbx)) in cmd_chain_prune (or the function handling
prune), and add an explicit precondition: if bc.height() > 0 and
bc.state_trie.as_ref().is_none() then return an error (e.g. anyhow!("persisted
trie/root missing; refuse to prune") ). Keep allowing operations when height ==
0 as before, but do not trigger any rebuild-on-open behavior here. Ensure the
check references bc.init_trie, bc.height, and bc.state_trie to locate the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro Plus
Run ID: 04bbd063-c2c7-44c0-a5be-5efe8172c4e5
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock,!**/*.lock
📒 Files selected for processing (6)
Cargo.tomlbin/sentrix/src/commands/chain.rsbin/sentrix/src/main.rscrates/sentrix-core/src/blockchain.rscrates/sentrix-core/src/blockchain_trie_ops.rscrates/sentrix-trie/src/tree.rs
…odeRabbit) - Clippy: background_prune_enabled uses is_none_or instead of !..is_some_and (nonminimal_bool, -D warnings in CI; cargo check didn't catch it). - chain prune Major guards (CodeRabbit on #800): - Refuse if no persisted trie root at the current height — never let the maintenance prune trigger an init_trie backfill rebuild (different node shape = reset-trie fork class). New Storage::has_persisted_trie_root. - Enforce the offline precondition instead of only printing it: detect a running node via height-stability (sample across >5s poll-persist interval) and fail closed if the chain advanced; override with SENTRIX_ALLOW_ONLINE_PRUNE=1 for rare recovery.
…802) test_c03_pass2_failure_rolls_back_state credits v1 to u64::MAX - reward and relies on the coinbase credit overflowing in Pass 2. It read get_block_reward() (which depends on the reward-fork env vars: VOYAGER_REWARD_V2_HEIGHT / TOKENOMICS_V2_HEIGHT / halving) WITHOUT holding env_test_lock, while sibling tests set those vars under that lock. Under the parallel `cargo test` run (e.g. the report-only coverage job) a concurrent mutation changed `reward` mid-test, the overflow didn't trigger, add_block returned Ok, and unwrap_err panicked — the recurring CI flake that needed re-runs on #795/#798/#800/#801. Acquire env_test_lock at the top of the test so it serializes against the env-mutating tests. Test-only change. sentrix-core suite + clippy --all-targets -D warnings clean.
The durable fix for the recurring trie "missing node" stalls.
Root cause (confirmed)
The background prune clones the trie at a frozen version and walks the live-set on a thread while block apply keeps committing. A node committed / content-addressed-resurfaced during the multi-minute walk is absent from the frozen snapshot → deleted as an orphan → later
create_blocktraversal hits "missing node" → 20s propose timeout → chain crawl. Five partial fixes (#711/#714/#791/#798) narrowed but never closed the window — #798 deleted live nodes again in production (val3 @ h=6300000).A truly race-free background prune needs walk+delete in one MDBX RW txn (a chain-blocking write lock for the 10–20 min walk) or refcounting — both repeatedly deferred.
Fix — eliminate the concurrency (mechanism-agnostic)
SentrixTrie::prune_offline(keep)— same walk + keep-window asprune, minus the racy augment + generational defer, using immediategc_orphaned_nodes. Correct because it runs with no concurrent commits.sentrix chain prune [--keep N]— operator runs it during a maintenance halt (same model asreset-trie/verify-deep). Safe on a single peer: deleting unreachable nodes does not change the state_root → no fork risk (unlike reset-trie).SENTRIX_ENABLE_BACKGROUND_TRIE_PRUNE=1(legacySENTRIX_DISABLE_TRIE_PRUNE=1still force-disables).maybe_prune_trieearly-returns by default → the apply path no longer spawns the racy thread.Trade-off
Storage grows between maintenance prunes. Acceptable — correctness over convenience after five automatic-prune failures.
Tests
test_prune_offline_keeps_reachable_deletes_orphans— reclaims orphans AND the current value survives (the regression guard fix(trie): generational GC to end the prune/commit missing-node race #798 lacked; a live-node deletion would make the post-prunegetfail with "missing node").test_background_prune_enabled_env_var— off by default + opt-in/force-disable semantics.cargo test -p sentrix-trie: 84 passed;-p sentrix-core: 263 passed;cargo check --workspace -D warningsclean.Operator runbook
Background prune stays off. To reclaim trie storage: halt the validator →
sentrix chain prune→ restart. (Testnet already hasSENTRIX_DISABLE_TRIE_PRUNE=1; this makes off the code default.)Summary by CodeRabbit
Release Notes
New Features
sentrix chain prunecommand for offline trie storage optimization and maintenance. Configure the--keepoption to control how many recent committed states to preserve (default: 1000) before removing obsolete trie data.Chores