Skip to content

fix(btcio): recover stranded canonical L1 height on restart#1944

Open
voidash wants to merge 3 commits into
mainfrom
fix/l1-stranded-canonical-height
Open

fix(btcio): recover stranded canonical L1 height on restart#1944
voidash wants to merge 3 commits into
mainfrom
fix/l1-stranded-canonical-height

Conversation

@voidash
Copy link
Copy Markdown
Contributor

@voidash voidash commented Jun 5, 2026

Problem

btcio writes the L1 canonical-chain entry before notifying the ASM worker. If the
process crashes in that window, restart resumes from canonical_tip + 1, so the
stranded tip height has a canonical entry but no ASM anchor state/manifest.

The ASM worker does eventually self-heal this: when the next L1 block arrives it
walks back over Bitcoin header prev_blockhash (independent of the canonical DB)
to the last height with a stored anchor state and replays the gap, re-driving the
stranded height. So the stranding is transient, not permanent — but it only
clears once a new Bitcoin block is mined and submitted. Until then the canonical
tip stays unmaterialized: ~10 min on mainnet, and indefinitely on a quiet chain or
a restarted node that receives no further blocks.

Tests

  • New test_l1_canonical_write_before_manifest_crash + debug bail tag
    btcio_after_l1_canonical_write. The test asserts the stranded tip is
    materialized after restart, before mining any further block, which is exactly
    the window this change closes (without it the tip would wait for the next block).

voidash added 2 commits June 5, 2026 11:43
Add a debug bail point immediately after btcio writes an L1
canonical-chain entry and before it notifies the ASM worker, plus a
functional test that crashes there, restarts, and asserts the height's
ASM manifest is regenerated.

Without the follow-up fix this test fails: restart computes the reader
target from the canonical tip, so the stranded height is skipped and its
manifest is never produced (all_manifests_present=false).
btcio writes an L1 canonical-chain entry before notifying the ASM
worker, and the reader awaits ASM completion per block. A crash in that
window leaves the canonical tip without an ASM anchor state, and restart
resumes from canonical_tip + 1, so the stranded height's manifest and
anchor state are never regenerated.

On reader init, walk down from the canonical tip (bounded by the reorg
lookback window) to the highest height with a stored ASM anchor state -
the last fully materialized height, since the ASM worker writes anchor
state last - and revert the canonical chain to it. The reader then
re-fetches and re-submits the stranded heights and the ASM worker
rebuilds the missing manifests and anchor states cleanly.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

Commit: ef73e6f

SP1 Execution Results

program cycles gas
EVM EE Chunk 824,732 969,395
EVM EE Account 404,056 498,593
Checkpoint 2,601,525 3,007,269

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

❌ Patch coverage is 75.75758% with 8 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.42%. Comparing base (b5d34b8) to head (fada87c).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
crates/btcio/src/reader/query.rs 72.41% 8 Missing ⚠️
@@           Coverage Diff            @@
##             main    #1944    +/-   ##
========================================
  Coverage   84.42%   84.42%            
========================================
  Files         637      637            
  Lines       76805    76701   -104     
========================================
- Hits        64840    64755    -85     
+ Misses      11965    11946    -19     
Flag Coverage Δ
functional 66.34% <75.75%> (-0.17%) ⬇️
unit 69.73% <0.00%> (+0.21%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crates/btcio/src/reader/handler.rs 95.45% <100.00%> (+0.10%) ⬆️
crates/storage/src/managers/asm.rs 100.00% <100.00%> (ø)
crates/btcio/src/reader/query.rs 82.40% <72.41%> (-1.58%) ⬇️

... and 61 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@voidash
Copy link
Copy Markdown
Contributor Author

voidash commented Jun 8, 2026

Pushed fada87c8606e with the minimal Cargo.lock update for the new strata-btcio -> strata-common dependency. The lockfile diff is one dependency entry only.

Local verification from the PR worktree:

  • cargo metadata --locked --format-version 1 --no-deps
  • cargo check -p strata-btcio --locked

The previous lockfile blocker is fixed. The latest red CI jobs are setup/download or follow-up CI failures, so I reran the failed Shellcheck, Lint, prover guest, and Tests workflows. The PR is still draft.

@voidash voidash marked this pull request as ready for review June 8, 2026 08:36
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fada87c860

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

};

let block = L1BlockCommitment::new(height, blockid);
if ctx.storage.asm().get_state_async(block).await?.is_some() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid replaying blocks after partial ASM writes

When the process dies during ASM materialization after AsmWorkerCtx::append_manifest_to_mmr has committed but before store_anchor_state runs, this check treats the height as unmaterialized and reverts only the L1 canonical entries. The replay then drives the same block through ASM again, but the manifest MMR append is not idempotent (append_manifest_to_mmr uses append_leaf_blocking), so the restart path can append a duplicate leaf and leave the persisted MMR inconsistent with the recovered anchor state. Please either detect/roll back the other ASM side effects or make the manifest-MMR write idempotent before using anchor-state absence as the sole recovery marker.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant