fix(asm-runner): catch up to bitcoind tip on restart#105
Conversation
|
Commit: 2e7d93f
|
Codecov Report❌ Patch coverage is
... and 10 files with indirect coverage changes 🚀 New features to boost your workflow:
|
Rajil1213
left a comment
There was a problem hiding this comment.
The backfilling API can be improved. Just using the btc_fetcher crate directly is also an option.
| .get_block_count() | ||
| .await | ||
| .context("failed to query bitcoind tip for startup catchup")?; | ||
| if tip_height >= cursor { |
There was a problem hiding this comment.
Isn't > sufficient? Was the duplication at the boundary (tip_height == cursor) deliberate?
| backfill_range( | ||
| &bitcoin_client, | ||
| &asm_worker, | ||
| &proof_tx, | ||
| cursor..=received_height - 1, | ||
| ) |
There was a problem hiding this comment.
This duplication hints at a deeper issue. Right now, the block watcher muddies the boundary between the bitcoin block fetcher and block submitter. A cleaner way to do this is to separate the two. If you look at the btc_tracker crate in strata-bridge, you'll find a loop that just ingests blocks from bitcoind, then there is a separate block that pushes blocks to subscribers. The backfilling logic belongs on the former (that is responsible for receiving blocks from bitcoind), the latter just takes those blocks and pushes them out.
This also means that you don't need a brittle functional test. You can just add a proptest that asserts that block production is contiguous.
Downstream consumers (in this case the asm worker) can just depend on the fact that provided a start height, it will always receive blocks in the order of block heights without any discontinuity.
| for height in range { | ||
| let block = fetch_block_at_height(client, height) | ||
| .await | ||
| .with_context(|| format!("backfill fetch failed at height {height}"))?; | ||
| submit_block(asm_worker, proof_tx, block) | ||
| .await | ||
| .with_context(|| format!("backfill submit failed at height {height}"))?; | ||
| } |
There was a problem hiding this comment.
Since this just depends on the block height, will there be issues if there is a fork around the current tip?
| # Snapshot a processed block we expect to survive the restart. | ||
| snapshot_height = initial_btc_height + 1 | ||
| snapshot_hash = bitcoin_rpc.proxy.getblockhash(snapshot_height) | ||
| pre_state = asm_rpc.strata_asm_getAsmState(snapshot_hash) | ||
| assert pre_state is not None, ( | ||
| f"strata_asm_getAsmState returned None at height {snapshot_height} pre-restart" | ||
| ) | ||
|
|
||
| # Mark where the post-restart slice of the log file begins. The runner | ||
| # appends to this file across stop/start, so a byte offset captured now | ||
| # cleanly partitions pre- vs post-restart output. | ||
| log_offset = os.path.getsize(log_path) | ||
|
|
||
| logging.info("stopping ASM runner at height %s", pre_restart_height) | ||
| asm_service.stop() | ||
|
|
||
| # Mine while the runner is down so it has to catch up on restart — | ||
| # exercises the watcher's gap-fill path, not just steady state. | ||
| catchup_blocks = 2 | ||
| bitcoin_rpc.proxy.generatetoaddress(catchup_blocks, wallet_addr) | ||
| post_restart_target = pre_restart_height + catchup_blocks | ||
|
|
||
| logging.info("restarting ASM runner") | ||
| asm_service.start() | ||
| asm_rpc = asm_service.create_rpc() | ||
| wait_until_asm_ready(asm_rpc) | ||
| wait_until_asm_reaches_height(asm_rpc, min_height=post_restart_target) | ||
| logging.info("ASM caught up past restart to height %s", post_restart_target) | ||
|
|
||
| # Resume vs replay: the genesis-bootstrap line only fires when the | ||
| # worker can't find an existing genesis manifest. If the post-restart | ||
| # log slice contains it, the runner threw away persisted state and | ||
| # rebuilt from scratch — exactly the failure mode the test is for. | ||
| with open(log_path, "rb") as f: | ||
| f.seek(log_offset) | ||
| post_log = f.read().decode("utf-8", errors="replace") | ||
| assert GENESIS_BOOTSTRAP_MARKER not in post_log, ( | ||
| f"runner re-emitted {GENESIS_BOOTSTRAP_MARKER!r} after restart — " | ||
| "it restarted from genesis instead of resuming from persisted state" | ||
| ) |
There was a problem hiding this comment.
You might also want to check out the log matcher utility in strata-bridge which wraps the above logic in wait_until (not necessary here because you're only checking the logs after the ASM catches up).
This is of course very brittle.
| # identical post-restart. Weaker than the log check on its own (a | ||
| # fresh replay would produce the same payload on the same chain), but | ||
| # catches durability regressions where the data is gone entirely. | ||
| post_state = asm_rpc.strata_asm_getAsmState(snapshot_hash) |
There was a problem hiding this comment.
Does it also make sense to query for an intermediate height (that is one of the blocks that got backfilled)? I'm thinking not because the ASM cannot progress without block continuity.
There was a problem hiding this comment.
Yes, ASM cannot progress without block continuity.
|
@codex review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 719abbbc71
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if tip_height >= cursor { | ||
| backfill_range(&bitcoin_client, &asm_worker, &proof_tx, cursor..=tip_height).await?; | ||
| cursor = tip_height + 1; |
There was a problem hiding this comment.
Honor shutdown during startup catch-up
When tip_height >= cursor, the watcher runs backfill_range(...) before entering the tokio::select! loop that listens for shutdown.wait_for_shutdown(). If the node is far behind (large restart gap), this RPC/backfill loop can run for a long time and the process will ignore shutdown requests until it finishes, which can make service stop/restart operations hang under backlog conditions. Please make startup catch-up cancellable (e.g., check shutdown between heights or interleave catch-up with the shutdown wait).
Useful? React with 👍 / 👎.
The ZMQ subscription only delivers blocks mined after we subscribe, so any heights between the worker's last persisted block and the current chain tip would otherwise wait indefinitely for a fresh event. Restarting after the chain had moved on left the runner hung with no recovery short of another block being mined live. Poll the tip via RPC and backfill the gap before entering the ZMQ wait loop. The in-loop backfill (rare ZMQ drops) and the new startup catchup now share a single `backfill_range` helper that bails on first failure to avoid handing the worker a gap.
Restart semantics live at the binary boundary — the worker reloads from sled, resumes from the last persisted block, and reconnects to bitcoind. The storage-layer unit tests only re-check sled's own durability and never exercise this path. Drive the runner past a few blocks, stop it, mine more while it's down, and assert it resumes from persisted state and catches up. The resume-vs-replay discriminator currently grep-matches the genesis-bootstrap log line; see the test's docstring for the rationale and a pointer to a follow-up that should surface this through the status RPC instead. Also expose the runner log path via the asm_rpc factory (needed by the test) and guard CARGO_ARGS expansion in run_test.sh so the native-backend path doesn't trip `set -u` when no extra cargo flags are set.
These exercised sled's own file-lock and durability across reopen, which sled already covers. They were also flaky on Linux due to a sled 0.34 race on tempdir cleanup — papered over in the test body with explicit `drop` and `flush` calls. End-to-end persistence is now covered at the level we actually care about (runner reload across stop/start) by the new asm-runner restart test, so these unit tests carry their flakiness for no incremental coverage.
719abbb to
8ad081e
Compare
|
Moving this to draft for now.
You're right. We should handle this more appropriately. Moving this to draft for now. |
Description
The runner's restart path was broken by an interaction between ZMQ block delivery and persisted state: ZMQ only forwards blocks mined after we subscribe, so when the runner came up with a persisted height below the chain tip, it sat idle waiting for an event that wouldn't fire until someone mined another block. Poll the tip via RPC and backfill the gap before entering the wait loop.
The branch also retires two
persistence_across_reopenunit tests incrates/storagethat were flaky on Linux (sled 0.34 tempdir-lock race). They duplicated coverage that sled itself provides, and were proxying for end-to-end behavior — runner state surviving stop/start — that we never actually tested. Added a flexitest case (fn_asm_restart_test.py) that does exercise it: drive the runner to height H, stop, mine more blocks, restart, assert it both resumes from persisted state and catches up.Type of Change
Notes to Reviewers
The resume-vs-replay assertion in the new test
fn_asm_restart_test.pycurrently grep-matches a log line ("Created genesis manifest") to detect that the runner did not re-bootstrap from genesis. This is fragile — it couples the test to wording incrates/worker/src/service.rs:172and to the fact that the harness log file is append-mode across restarts.An alternative considered was exposing a discriminator directly through the existing
statusRPC.However, it expands the public RPC surface for what is, today, a primarily test-driven need, and an API contract decision is worth its own discussion rather than slipping in alongside a hang fix. The current test catches the regression we care about and will fail loudly (not silently) if the log line moves. Happy to do the RPC change as a follow-up if reviewers agree on the shape.
The
run_test.shchange is collateral: theset -uscript trips over"${CARGO_ARGS[@]}"when the array is empty (native backend). Fixed in passing so the new test can run.Checklist
Related Issues