fix(btcio): recover stranded canonical L1 height on restart#1944
fix(btcio): recover stranded canonical L1 height on restart#1944voidash wants to merge 3 commits into
Conversation
Add a debug bail point immediately after btcio writes an L1 canonical-chain entry and before it notifies the ASM worker, plus a functional test that crashes there, restarts, and asserts the height's ASM manifest is regenerated. Without the follow-up fix this test fails: restart computes the reader target from the canonical tip, so the stranded height is skipped and its manifest is never produced (all_manifests_present=false).
btcio writes an L1 canonical-chain entry before notifying the ASM worker, and the reader awaits ASM completion per block. A crash in that window leaves the canonical tip without an ASM anchor state, and restart resumes from canonical_tip + 1, so the stranded height's manifest and anchor state are never regenerated. On reader init, walk down from the canonical tip (bounded by the reorg lookback window) to the highest height with a stored ASM anchor state - the last fully materialized height, since the ASM worker writes anchor state last - and revert the canonical chain to it. The reader then re-fetches and re-submits the stranded heights and the ASM worker rebuilds the missing manifests and anchor states cleanly.
|
Commit: ef73e6f SP1 Execution Results
|
Codecov Report❌ Patch coverage is
@@ Coverage Diff @@
## main #1944 +/- ##
========================================
Coverage 84.42% 84.42%
========================================
Files 637 637
Lines 76805 76701 -104
========================================
- Hits 64840 64755 -85
+ Misses 11965 11946 -19
Flags with carried forward coverage won't be shown. Click here to find out more.
... and 61 files with indirect coverage changes 🚀 New features to boost your workflow:
|
|
Pushed Local verification from the PR worktree:
The previous lockfile blocker is fixed. The latest red CI jobs are setup/download or follow-up CI failures, so I reran the failed Shellcheck, Lint, prover guest, and Tests workflows. The PR is still draft. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fada87c860
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| }; | ||
|
|
||
| let block = L1BlockCommitment::new(height, blockid); | ||
| if ctx.storage.asm().get_state_async(block).await?.is_some() { |
There was a problem hiding this comment.
Avoid replaying blocks after partial ASM writes
When the process dies during ASM materialization after AsmWorkerCtx::append_manifest_to_mmr has committed but before store_anchor_state runs, this check treats the height as unmaterialized and reverts only the L1 canonical entries. The replay then drives the same block through ASM again, but the manifest MMR append is not idempotent (append_manifest_to_mmr uses append_leaf_blocking), so the restart path can append a duplicate leaf and leave the persisted MMR inconsistent with the recovered anchor state. Please either detect/roll back the other ASM side effects or make the manifest-MMR write idempotent before using anchor-state absence as the sole recovery marker.
Useful? React with 👍 / 👎.
Problem
btcio writes the L1 canonical-chain entry before notifying the ASM worker. If the
process crashes in that window, restart resumes from
canonical_tip + 1, so thestranded tip height has a canonical entry but no ASM anchor state/manifest.
The ASM worker does eventually self-heal this: when the next L1 block arrives it
walks back over Bitcoin header
prev_blockhash(independent of the canonical DB)to the last height with a stored anchor state and replays the gap, re-driving the
stranded height. So the stranding is transient, not permanent — but it only
clears once a new Bitcoin block is mined and submitted. Until then the canonical
tip stays unmaterialized: ~10 min on mainnet, and indefinitely on a quiet chain or
a restarted node that receives no further blocks.
Tests
test_l1_canonical_write_before_manifest_crash+ debug bail tagbtcio_after_l1_canonical_write. The test asserts the stranded tip ismaterialized after restart, before mining any further block, which is exactly
the window this change closes (without it the tip would wait for the next block).