-
Notifications
You must be signed in to change notification settings - Fork 0
fix(asm-runner): catch up to bitcoind tip on restart #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -8,9 +8,9 @@ | |
| //! real-time block notification with `bury_depth=0` (no reorg tracking, no | ||
| //! tx monitoring). Written to avoid a painful dependency on `strata-bridge`. | ||
|
|
||
| use std::{sync::Arc, time::Duration}; | ||
| use std::{ops::RangeInclusive, sync::Arc, time::Duration}; | ||
|
|
||
| use anyhow::{Context, Result, bail}; | ||
| use anyhow::{Context, Result}; | ||
| use bitcoin::Block; | ||
| use bitcoincore_zmq::{Message, SocketMessage, subscribe_async_wait_handshake}; | ||
| use bitcoind_async_client::{Client, traits::Reader}; | ||
|
|
@@ -54,6 +54,19 @@ pub(crate) async fn drive_asm_from_bitcoin( | |
| let mut stream = stream; | ||
| let mut cursor = start_height; | ||
|
|
||
| // ZMQ only delivers blocks mined after we subscribe, so any heights between | ||
| // `cursor` and the current tip — typically blocks mined while we were down — | ||
| // would otherwise wait forever for a fresh ZMQ event that may never come. | ||
| // Poll the tip via RPC and fill the gap before entering the wait loop. | ||
| let tip_height = bitcoin_client | ||
| .get_block_count() | ||
| .await | ||
| .context("failed to query bitcoind tip for startup catchup")?; | ||
| if tip_height >= cursor { | ||
| backfill_range(&bitcoin_client, &asm_worker, &proof_tx, cursor..=tip_height).await?; | ||
| cursor = tip_height + 1; | ||
|
Comment on lines
+65
to
+67
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
When Useful? React with 👍 / 👎. |
||
| } | ||
|
|
||
| loop { | ||
| let msg = tokio::select! { | ||
| _ = shutdown.wait_for_shutdown() => { | ||
|
|
@@ -94,30 +107,17 @@ pub(crate) async fn drive_asm_from_bitcoin( | |
| continue; | ||
| } | ||
|
|
||
| // Backfill any skipped heights [cursor, received_height). This covers | ||
| // the common case of starting after a downtime, or rare ZMQ drops. | ||
| // Backfill any skipped heights [cursor, received_height - 1]. Covers | ||
| // rare ZMQ drops between events; the startup-tip catchup above handles | ||
| // the restart case. | ||
| if received_height > cursor { | ||
| info!( | ||
| from = %cursor, | ||
| to = %received_height, | ||
| "backfilling skipped blocks" | ||
| ); | ||
| for height in cursor..received_height { | ||
| match fetch_block_at_height(&bitcoin_client, height).await { | ||
| Ok(fetched) => { | ||
| if let Err(err) = submit_block(&asm_worker, &proof_tx, fetched).await { | ||
| error!(%height, ?err, "failed to submit backfill block"); | ||
| // Stop backfilling on failure so we don't hand the | ||
| // worker a gap. The next ZMQ event will retry. | ||
| bail!("backfill interrupted at height {height}: {err}"); | ||
| } | ||
| } | ||
| Err(err) => { | ||
| error!(%height, ?err, "failed to fetch backfill block"); | ||
| bail!("backfill fetch failed at height {height}: {err}"); | ||
| } | ||
| } | ||
| } | ||
| backfill_range( | ||
| &bitcoin_client, | ||
| &asm_worker, | ||
| &proof_tx, | ||
| cursor..=received_height - 1, | ||
| ) | ||
|
Comment on lines
+114
to
+119
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This duplication hints at a deeper issue. Right now, the block watcher muddies the boundary between the bitcoin block fetcher and block submitter. A cleaner way to do this is to separate the two. If you look at the This also means that you don't need a brittle functional test. You can just add a proptest that asserts that block production is contiguous. Downstream consumers (in this case the asm worker) can just depend on the fact that provided a start height, it will always receive blocks in the order of block heights without any discontinuity. |
||
| .await?; | ||
| } | ||
|
|
||
| if let Err(err) = submit_block(&asm_worker, &proof_tx, block).await { | ||
|
|
@@ -127,6 +127,26 @@ pub(crate) async fn drive_asm_from_bitcoin( | |
| } | ||
| } | ||
|
|
||
| /// Fetch and submit every block in `range` via RPC, bailing on the first | ||
| /// failure so we never hand the worker a gap. | ||
| async fn backfill_range( | ||
| client: &Client, | ||
| asm_worker: &AsmWorkerHandle, | ||
| proof_tx: &Option<mpsc::UnboundedSender<ProofId>>, | ||
| range: RangeInclusive<u64>, | ||
| ) -> Result<()> { | ||
| info!(from = %range.start(), to = %range.end(), "backfilling skipped blocks"); | ||
| for height in range { | ||
| let block = fetch_block_at_height(client, height) | ||
| .await | ||
| .with_context(|| format!("backfill fetch failed at height {height}"))?; | ||
| submit_block(asm_worker, proof_tx, block) | ||
| .await | ||
| .with_context(|| format!("backfill submit failed at height {height}"))?; | ||
| } | ||
|
Comment on lines
+139
to
+146
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since this just depends on the block height, will there be issues if there is a fork around the current tip? |
||
| Ok(()) | ||
| } | ||
|
|
||
| /// Fetch a single block by height via the bitcoind RPC client. | ||
| async fn fetch_block_at_height(client: &Client, height: u64) -> Result<Block> { | ||
| let hash = client | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,107 @@ | ||
| import logging | ||
| import os | ||
|
|
||
| import flexitest | ||
|
|
||
| from utils.utils import ( | ||
| wait_until_asm_reaches_height, | ||
| wait_until_asm_ready, | ||
| wait_until_bitcoind_ready, | ||
| ) | ||
|
|
||
| # Emitted by the worker only on first-bootstrap, gated on | ||
| # `!ctx.has_l1_manifest(pivot_block.blkid())` — so its absence in a post-restart | ||
| # log slice is direct evidence that the runner resumed from persisted state | ||
| # rather than replaying from genesis. See crates/worker/src/service.rs. | ||
| GENESIS_BOOTSTRAP_MARKER = "Created genesis manifest" | ||
|
|
||
|
|
||
| @flexitest.register | ||
| class AsmRestartTest(flexitest.Test): | ||
| """End-to-end coverage of the runner's restart path. | ||
|
|
||
| Persistence belongs at the binary level — the worker reloads from sled, | ||
| resumes from the last persisted block, and reconnects to bitcoind. Unit | ||
| tests in `asm-storage` only exercise sled's own durability, which sled | ||
| already covers. | ||
|
|
||
| A naive "state at height H matches" assertion would also hold for a fresh | ||
| runner that just replayed the same chain from genesis. To distinguish | ||
| resume from replay, we read the worker log: the genesis-bootstrap line | ||
| must not appear after the restart. | ||
| """ | ||
|
|
||
| def __init__(self, ctx: flexitest.InitContext): | ||
| ctx.set_env("basic") | ||
|
|
||
| def main(self, ctx: flexitest.RunContext): | ||
| bitcoind_service = ctx.get_service("bitcoin") | ||
| asm_service = ctx.get_service("asm_rpc") | ||
| log_path = asm_service.props["log_path"] | ||
|
|
||
| bitcoin_rpc = bitcoind_service.create_rpc() | ||
| asm_rpc = asm_service.create_rpc() | ||
|
|
||
| wait_until_bitcoind_ready(bitcoin_rpc, timeout=30) | ||
| wait_until_asm_ready(asm_rpc) | ||
|
|
||
| # Drive ASM to a known height before restarting. | ||
| initial_btc_height = bitcoin_rpc.proxy.getblockcount() | ||
| wallet_addr = bitcoin_rpc.proxy.getnewaddress() | ||
| pre_blocks = 3 | ||
| bitcoin_rpc.proxy.generatetoaddress(pre_blocks, wallet_addr) | ||
| pre_restart_height = initial_btc_height + pre_blocks | ||
| wait_until_asm_reaches_height(asm_rpc, min_height=pre_restart_height) | ||
|
|
||
| # Snapshot a processed block we expect to survive the restart. | ||
| snapshot_height = initial_btc_height + 1 | ||
| snapshot_hash = bitcoin_rpc.proxy.getblockhash(snapshot_height) | ||
| pre_state = asm_rpc.strata_asm_getAsmState(snapshot_hash) | ||
| assert pre_state is not None, ( | ||
| f"strata_asm_getAsmState returned None at height {snapshot_height} pre-restart" | ||
| ) | ||
|
|
||
| # Mark where the post-restart slice of the log file begins. The runner | ||
| # appends to this file across stop/start, so a byte offset captured now | ||
| # cleanly partitions pre- vs post-restart output. | ||
| log_offset = os.path.getsize(log_path) | ||
|
|
||
| logging.info("stopping ASM runner at height %s", pre_restart_height) | ||
| asm_service.stop() | ||
|
|
||
| # Mine while the runner is down so it has to catch up on restart — | ||
| # exercises the watcher's gap-fill path, not just steady state. | ||
| catchup_blocks = 2 | ||
| bitcoin_rpc.proxy.generatetoaddress(catchup_blocks, wallet_addr) | ||
| post_restart_target = pre_restart_height + catchup_blocks | ||
|
|
||
| logging.info("restarting ASM runner") | ||
| asm_service.start() | ||
| asm_rpc = asm_service.create_rpc() | ||
| wait_until_asm_ready(asm_rpc) | ||
| wait_until_asm_reaches_height(asm_rpc, min_height=post_restart_target) | ||
| logging.info("ASM caught up past restart to height %s", post_restart_target) | ||
|
|
||
| # Resume vs replay: the genesis-bootstrap line only fires when the | ||
| # worker can't find an existing genesis manifest. If the post-restart | ||
| # log slice contains it, the runner threw away persisted state and | ||
| # rebuilt from scratch — exactly the failure mode the test is for. | ||
| with open(log_path, "rb") as f: | ||
| f.seek(log_offset) | ||
| post_log = f.read().decode("utf-8", errors="replace") | ||
| assert GENESIS_BOOTSTRAP_MARKER not in post_log, ( | ||
| f"runner re-emitted {GENESIS_BOOTSTRAP_MARKER!r} after restart — " | ||
| "it restarted from genesis instead of resuming from persisted state" | ||
| ) | ||
|
Comment on lines
+56
to
+95
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You might also want to check out the log matcher utility in This is of course very brittle. |
||
|
|
||
| # Sanity: state for a pre-restart block is still queryable and | ||
| # identical post-restart. Weaker than the log check on its own (a | ||
| # fresh replay would produce the same payload on the same chain), but | ||
| # catches durability regressions where the data is gone entirely. | ||
| post_state = asm_rpc.strata_asm_getAsmState(snapshot_hash) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does it also make sense to query for an intermediate height (that is one of the blocks that got backfilled)? I'm thinking not because the ASM cannot progress without block continuity.
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, ASM cannot progress without block continuity. |
||
| assert post_state == pre_state, ( | ||
| f"AsmState at height {snapshot_height} changed across restart: " | ||
| f"pre={pre_state!r} post={post_state!r}" | ||
| ) | ||
|
|
||
| return True | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't
>sufficient? Was the duplication at the boundary (tip_height == cursor) deliberate?