Skip to content

fix(trie): race-free offline prune + background prune off by default#800

Merged
github-actions[bot] merged 2 commits into
mainfrom
fix/trie-offline-prune
Jun 6, 2026
Merged

fix(trie): race-free offline prune + background prune off by default#800
github-actions[bot] merged 2 commits into
mainfrom
fix/trie-offline-prune

Conversation

@satyakwok
Copy link
Copy Markdown
Member

@satyakwok satyakwok commented Jun 6, 2026

The durable fix for the recurring trie "missing node" stalls.

Root cause (confirmed)

The background prune clones the trie at a frozen version and walks the live-set on a thread while block apply keeps committing. A node committed / content-addressed-resurfaced during the multi-minute walk is absent from the frozen snapshot → deleted as an orphan → later create_block traversal hits "missing node" → 20s propose timeout → chain crawl. Five partial fixes (#711/#714/#791/#798) narrowed but never closed the window — #798 deleted live nodes again in production (val3 @ h=6300000).

A truly race-free background prune needs walk+delete in one MDBX RW txn (a chain-blocking write lock for the 10–20 min walk) or refcounting — both repeatedly deferred.

Fix — eliminate the concurrency (mechanism-agnostic)

  • SentrixTrie::prune_offline(keep) — same walk + keep-window as prune, minus the racy augment + generational defer, using immediate gc_orphaned_nodes. Correct because it runs with no concurrent commits.
  • sentrix chain prune [--keep N] — operator runs it during a maintenance halt (same model as reset-trie/verify-deep). Safe on a single peer: deleting unreachable nodes does not change the state_root → no fork risk (unlike reset-trie).
  • Background prune OFF by default — runs only with SENTRIX_ENABLE_BACKGROUND_TRIE_PRUNE=1 (legacy SENTRIX_DISABLE_TRIE_PRUNE=1 still force-disables). maybe_prune_trie early-returns by default → the apply path no longer spawns the racy thread.

Trade-off

Storage grows between maintenance prunes. Acceptable — correctness over convenience after five automatic-prune failures.

Tests

  • test_prune_offline_keeps_reachable_deletes_orphans — reclaims orphans AND the current value survives (the regression guard fix(trie): generational GC to end the prune/commit missing-node race #798 lacked; a live-node deletion would make the post-prune get fail with "missing node").
  • test_background_prune_enabled_env_var — off by default + opt-in/force-disable semantics.
  • cargo test -p sentrix-trie: 84 passed; -p sentrix-core: 263 passed; cargo check --workspace -D warnings clean.

Operator runbook

Background prune stays off. To reclaim trie storage: halt the validator → sentrix chain prune → restart. (Testnet already has SENTRIX_DISABLE_TRIE_PRUNE=1; this makes off the code default.)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added sentrix chain prune command for offline trie storage optimization and maintenance. Configure the --keep option to control how many recent committed states to preserve (default: 1000) before removing obsolete trie data.
  • Chores

    • Version bumped to 2.2.40.

The durable fix for the recurring trie "missing node" stalls. Root cause
(confirmed): the BACKGROUND prune runs on a thread cloning the trie at a
frozen version while block apply keeps committing; a node committed or
content-addressed-resurfaced during the multi-minute live-set walk is absent
from the frozen snapshot and gets deleted as an orphan. Five partial fixes
(#711 reload-before-gc, #714 collect_reachable depth, #791 split passes,
#798 generational defer) each narrowed but never closed the window — #798
deleted live nodes again in production (val3, h=6300000 prune → "missing
node" → propose stall). A truly race-free background prune needs walk+delete
in one MDBX RW txn (a chain-blocking write lock for the 10-20 min walk) or
refcounting — both deferred.

This takes the mechanism-agnostic safe route: eliminate the concurrency.

- `SentrixTrie::prune_offline(keep)` — same walk + keep-window as `prune`,
  minus the racy augment and the generational deferral, using the combined
  immediate `gc_orphaned_nodes`. Correct only with no concurrent commits,
  which is guaranteed by running it on a STOPPED node.
- `sentrix chain prune [--keep N]` — operator runs it during a maintenance
  halt (same model as `chain reset-trie` / `verify-deep`). Safe on a single
  peer: deleting unreachable nodes does not change the state_root (which only
  commits reachable nodes), so no fork risk — unlike reset-trie.
- Background prune is now OFF by default. It only runs with
  SENTRIX_ENABLE_BACKGROUND_TRIE_PRUNE=1 (and the legacy
  SENTRIX_DISABLE_TRIE_PRUNE=1 still force-disables). maybe_prune_trie
  early-returns by default, so the apply path no longer even spawns the
  racy thread.

Trade-off: storage grows between maintenance prunes. Acceptable; correctness
over convenience after five automatic-prune failures.

Tests: prune_offline reclaims orphans AND the current value survives the
prune (the regression guard #798 lacked — a live-node deletion would make
the post-prune `get` fail with "missing node"); background_prune_enabled
gate is off by default + opt-in semantics. cargo test -p sentrix-trie: 84
passed; -p sentrix-core: 263 passed; cargo check --workspace -D warnings clean.
@github-actions github-actions Bot enabled auto-merge (squash) June 6, 2026 10:20
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 6, 2026

Codecov Report

❌ Patch coverage is 88.09524% with 5 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/sentrix-trie/src/tree.rs 88.09% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 6, 2026

Worried about impact? Review this PR in Change Stack to explore blast radius before you approve or request changes.

Review Change Stack

Warning

Review limit reached

@satyakwok, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 40 minutes and 3 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 5a2035a2-0eae-4ccd-852b-cc903a5f06cc

📥 Commits

Reviewing files that changed from the base of the PR and between fd0c43a and cfb5d34.

📒 Files selected for processing (3)
  • bin/sentrix/src/commands/chain.rs
  • crates/sentrix-core/src/blockchain_trie_ops.rs
  • crates/sentrix-core/src/storage.rs
📝 Walkthrough

Walkthrough

This PR adds an offline trie garbage collection feature to prune unreachable nodes from long-running blockchain state, along with environment-variable gating to control background pruning behavior. A new SentrixTrie::prune_offline(keep_versions) method walks reachability from the current root and immediately removes orphaned nodes. Background pruning is now off by default and requires explicit SENTRIX_ENABLE_BACKGROUND_TRIE_PRUNE=1 to enable. The capability is exposed to operators via a new sentrix chain prune --keep N CLI subcommand that invokes the offline collection. Version is bumped to 2.2.40.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • sentrix-labs/sentrix#633: Updates environment-variable gating for trie pruning to use strict "1" semantics, aligning with the background-prune-enabled logic introduced here.
  • sentrix-labs/sentrix#584: Introduces the background-thread trie-pruning mechanism that this PR gates behind the new SENTRIX_ENABLE_BACKGROUND_TRIE_PRUNE environment variable.
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: adding a race-free offline prune and disabling background prune by default.
Description check ✅ Passed The PR description comprehensively covers the root cause, fix strategy, trade-offs, tests, and operator runbook, addressing all critical aspects.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/trie-offline-prune

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@bin/sentrix/src/commands/chain.rs`:
- Around line 288-290: The handler currently only prints a warning and then
calls trie.prune_offline(keep); change it to fail closed by acquiring an
inter-process exclusivity/maintenance lock before calling trie.prune_offline:
attempt to open the MDBX environment or the chain DB path in a way that fails if
another writer is active (or create and lock a dedicated maintenance lock file
with an exclusive flock), check that lock acquisition succeeded, and return an
error (abort) if it did not; reference trie.prune_offline and the command
handler around where the two println! calls and trie.prune_offline(keep) are
invoked so the prune only runs when the exclusive lock is held.
- Around line 279-286: The pruning command currently calls bc.init_trie(...)
which can rebuild the trie from AccountDB; instead, refuse to proceed if there
is no already-persisted trie/root. Remove or skip the call to
bc.init_trie(Arc::clone(&mdbx)) in cmd_chain_prune (or the function handling
prune), and add an explicit precondition: if bc.height() > 0 and
bc.state_trie.as_ref().is_none() then return an error (e.g. anyhow!("persisted
trie/root missing; refuse to prune") ). Keep allowing operations when height ==
0 as before, but do not trigger any rebuild-on-open behavior here. Ensure the
check references bc.init_trie, bc.height, and bc.state_trie to locate the
change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro Plus

Run ID: 04bbd063-c2c7-44c0-a5be-5efe8172c4e5

📥 Commits

Reviewing files that changed from the base of the PR and between 1550e9c and fd0c43a.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock, !**/*.lock
📒 Files selected for processing (6)
  • Cargo.toml
  • bin/sentrix/src/commands/chain.rs
  • bin/sentrix/src/main.rs
  • crates/sentrix-core/src/blockchain.rs
  • crates/sentrix-core/src/blockchain_trie_ops.rs
  • crates/sentrix-trie/src/tree.rs

Comment thread bin/sentrix/src/commands/chain.rs Outdated
Comment thread bin/sentrix/src/commands/chain.rs
@satyakwok satyakwok self-assigned this Jun 6, 2026
…odeRabbit)

- Clippy: background_prune_enabled uses is_none_or instead of !..is_some_and
  (nonminimal_bool, -D warnings in CI; cargo check didn't catch it).
- chain prune Major guards (CodeRabbit on #800):
  - Refuse if no persisted trie root at the current height — never let the
    maintenance prune trigger an init_trie backfill rebuild (different node
    shape = reset-trie fork class). New Storage::has_persisted_trie_root.
  - Enforce the offline precondition instead of only printing it: detect a
    running node via height-stability (sample across >5s poll-persist
    interval) and fail closed if the chain advanced; override with
    SENTRIX_ALLOW_ONLINE_PRUNE=1 for rare recovery.
@github-actions github-actions Bot merged commit 71a080b into main Jun 6, 2026
19 checks passed
@satyakwok satyakwok deleted the fix/trie-offline-prune branch June 6, 2026 11:12
github-actions Bot pushed a commit that referenced this pull request Jun 6, 2026
…802)

test_c03_pass2_failure_rolls_back_state credits v1 to u64::MAX - reward and
relies on the coinbase credit overflowing in Pass 2. It read get_block_reward()
(which depends on the reward-fork env vars: VOYAGER_REWARD_V2_HEIGHT /
TOKENOMICS_V2_HEIGHT / halving) WITHOUT holding env_test_lock, while sibling
tests set those vars under that lock. Under the parallel `cargo test` run (e.g.
the report-only coverage job) a concurrent mutation changed `reward` mid-test,
the overflow didn't trigger, add_block returned Ok, and unwrap_err panicked —
the recurring CI flake that needed re-runs on #795/#798/#800/#801.

Acquire env_test_lock at the top of the test so it serializes against the
env-mutating tests. Test-only change. sentrix-core suite + clippy --all-targets
-D warnings clean.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant