fix(asm-runner): catch up to bitcoind tip on restart by prajwolrg · Pull Request #105 · alpenlabs/asm

prajwolrg · 2026-05-17T10:36:06Z

Description

The runner's restart path was broken by an interaction between ZMQ block delivery and persisted state: ZMQ only forwards blocks mined after we subscribe, so when the runner came up with a persisted height below the chain tip, it sat idle waiting for an event that wouldn't fire until someone mined another block. Poll the tip via RPC and backfill the gap before entering the wait loop.

The branch also retires two persistence_across_reopen unit tests in crates/storage that were flaky on Linux (sled 0.34 tempdir-lock race). They duplicated coverage that sled itself provides, and were proxying for end-to-end behavior — runner state surviving stop/start — that we never actually tested. Added a flexitest case (fn_asm_restart_test.py) that does exercise it: drive the runner to height H, stop, mine more blocks, restart, assert it both resumes from persisted state and catches up.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New or updated tests

Notes to Reviewers

The resume-vs-replay assertion in the new test fn_asm_restart_test.py currently grep-matches a log line ("Created genesis manifest") to detect that the runner did not re-bootstrap from genesis. This is fragile — it couples the test to wording in crates/worker/src/service.rs:172 and to the fact that the harness log file is append-mode across restarts.

An alternative considered was exposing a discriminator directly through the existing status RPC.

pub enum StartupKind {
    Bootstrapped,
    Resumed { from_height: u64 },
}

pub struct AsmWorkerStatus {
    // ...existing fields...
    pub startup_kind: StartupKind,
}

However, it expands the public RPC surface for what is, today, a primarily test-driven need, and an API contract decision is worth its own discussion rather than slipping in alongside a hang fix. The current test catches the regression we care about and will fail loudly (not silently) if the log line moves. Happy to do the RPC change as a follow-up if reviewers agree on the shape.

The run_test.sh change is collateral: the set -u script trips over "${CARGO_ARGS[@]}" when the array is empty (native backend). Fixed in passing so the new test can run.

Checklist

I have performed a self-review of my code.
I have commented my code where necessary.
I have updated the documentation if needed.
My changes do not introduce new warnings.
I have added tests that prove my changes are effective or that my feature works.
New and existing tests pass with my changes.

Related Issues

github-actions · 2026-05-17T10:45:49Z

Commit: 2e7d93f
SP1 Execution Results

program	cycles	gas
asm-stf	130,103,924	129,731,987
moho	5,191,380	5,499,715

codecov · 2026-05-17T10:48:55Z

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
bin/asm-runner/src/block_watcher.rs	74.19%	8 Missing ⚠️

Files with missing lines	Coverage Δ
crates/storage/src/export_entries.rs	`98.52% <ø> (-0.14%)`	⬇️
crates/storage/src/mmr.rs	`100.00% <ø> (ø)`
bin/asm-runner/src/block_watcher.rs	`83.47% <74.19%> (+0.32%)`	⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Rajil1213

The backfilling API can be improved. Just using the btc_fetcher crate directly is also an option.

Rajil1213 · 2026-05-18T11:08:30Z

+        .get_block_count()
+        .await
+        .context("failed to query bitcoind tip for startup catchup")?;
+    if tip_height >= cursor {


Isn't > sufficient? Was the duplication at the boundary (tip_height == cursor) deliberate?

Rajil1213 · 2026-05-18T11:19:30Z

+            backfill_range(
+                &bitcoin_client,
+                &asm_worker,
+                &proof_tx,
+                cursor..=received_height - 1,
+            )


This duplication hints at a deeper issue. Right now, the block watcher muddies the boundary between the bitcoin block fetcher and block submitter. A cleaner way to do this is to separate the two. If you look at the btc_tracker crate in strata-bridge, you'll find a loop that just ingests blocks from bitcoind, then there is a separate block that pushes blocks to subscribers. The backfilling logic belongs on the former (that is responsible for receiving blocks from bitcoind), the latter just takes those blocks and pushes them out.

This also means that you don't need a brittle functional test. You can just add a proptest that asserts that block production is contiguous.

Downstream consumers (in this case the asm worker) can just depend on the fact that provided a start height, it will always receive blocks in the order of block heights without any discontinuity.

Rajil1213 · 2026-05-18T11:20:41Z

+    for height in range {
+        let block = fetch_block_at_height(client, height)
+            .await
+            .with_context(|| format!("backfill fetch failed at height {height}"))?;
+        submit_block(asm_worker, proof_tx, block)
+            .await
+            .with_context(|| format!("backfill submit failed at height {height}"))?;
+    }


Since this just depends on the block height, will there be issues if there is a fork around the current tip?

Rajil1213 · 2026-05-18T11:24:56Z

+        # Snapshot a processed block we expect to survive the restart.
+        snapshot_height = initial_btc_height + 1
+        snapshot_hash = bitcoin_rpc.proxy.getblockhash(snapshot_height)
+        pre_state = asm_rpc.strata_asm_getAsmState(snapshot_hash)
+        assert pre_state is not None, (
+            f"strata_asm_getAsmState returned None at height {snapshot_height} pre-restart"
+        )
+
+        # Mark where the post-restart slice of the log file begins. The runner
+        # appends to this file across stop/start, so a byte offset captured now
+        # cleanly partitions pre- vs post-restart output.
+        log_offset = os.path.getsize(log_path)
+
+        logging.info("stopping ASM runner at height %s", pre_restart_height)
+        asm_service.stop()
+
+        # Mine while the runner is down so it has to catch up on restart —
+        # exercises the watcher's gap-fill path, not just steady state.
+        catchup_blocks = 2
+        bitcoin_rpc.proxy.generatetoaddress(catchup_blocks, wallet_addr)
+        post_restart_target = pre_restart_height + catchup_blocks
+
+        logging.info("restarting ASM runner")
+        asm_service.start()
+        asm_rpc = asm_service.create_rpc()
+        wait_until_asm_ready(asm_rpc)
+        wait_until_asm_reaches_height(asm_rpc, min_height=post_restart_target)
+        logging.info("ASM caught up past restart to height %s", post_restart_target)
+
+        # Resume vs replay: the genesis-bootstrap line only fires when the
+        # worker can't find an existing genesis manifest. If the post-restart
+        # log slice contains it, the runner threw away persisted state and
+        # rebuilt from scratch — exactly the failure mode the test is for.
+        with open(log_path, "rb") as f:
+            f.seek(log_offset)
+            post_log = f.read().decode("utf-8", errors="replace")
+        assert GENESIS_BOOTSTRAP_MARKER not in post_log, (
+            f"runner re-emitted {GENESIS_BOOTSTRAP_MARKER!r} after restart — "
+            "it restarted from genesis instead of resuming from persisted state"
+        )


You might also want to check out the log matcher utility in strata-bridge which wraps the above logic in wait_until (not necessary here because you're only checking the logs after the ASM catches up).

This is of course very brittle.

Rajil1213 · 2026-05-18T11:25:43Z

+        # identical post-restart. Weaker than the log check on its own (a
+        # fresh replay would produce the same payload on the same chain), but
+        # catches durability regressions where the data is gone entirely.
+        post_state = asm_rpc.strata_asm_getAsmState(snapshot_hash)


Does it also make sense to query for an intermediate height (that is one of the blocks that got backfilled)? I'm thinking not because the ASM cannot progress without block continuity.

Yes, ASM cannot progress without block continuity.

Rajil1213 · 2026-05-18T11:28:42Z

@codex review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 719abbbc71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T11:33:29Z

+    if tip_height >= cursor {
+        backfill_range(&bitcoin_client, &asm_worker, &proof_tx, cursor..=tip_height).await?;
+        cursor = tip_height + 1;


Honor shutdown during startup catch-up

When tip_height >= cursor, the watcher runs backfill_range(...) before entering the tokio::select! loop that listens for shutdown.wait_for_shutdown(). If the node is far behind (large restart gap), this RPC/backfill loop can run for a long time and the process will ignore shutdown requests until it finishes, which can make service stop/restart operations hang under backlog conditions. Please make startup catch-up cancellable (e.g., check shutdown between heights or interleave catch-up with the shutdown wait).

Useful? React with 👍 / 👎.

The ZMQ subscription only delivers blocks mined after we subscribe, so any heights between the worker's last persisted block and the current chain tip would otherwise wait indefinitely for a fresh event. Restarting after the chain had moved on left the runner hung with no recovery short of another block being mined live. Poll the tip via RPC and backfill the gap before entering the ZMQ wait loop. The in-loop backfill (rare ZMQ drops) and the new startup catchup now share a single `backfill_range` helper that bails on first failure to avoid handing the worker a gap.

Restart semantics live at the binary boundary — the worker reloads from sled, resumes from the last persisted block, and reconnects to bitcoind. The storage-layer unit tests only re-check sled's own durability and never exercise this path. Drive the runner past a few blocks, stop it, mine more while it's down, and assert it resumes from persisted state and catches up. The resume-vs-replay discriminator currently grep-matches the genesis-bootstrap log line; see the test's docstring for the rationale and a pointer to a follow-up that should surface this through the status RPC instead. Also expose the runner log path via the asm_rpc factory (needed by the test) and guard CARGO_ARGS expansion in run_test.sh so the native-backend path doesn't trip `set -u` when no extra cargo flags are set.

These exercised sled's own file-lock and durability across reopen, which sled already covers. They were also flaky on Linux due to a sled 0.34 race on tempdir cleanup — papered over in the test body with explicit `drop` and `flush` calls. End-to-end persistence is now covered at the level we actually care about (runner reload across stop/start) by the new asm-runner restart test, so these unit tests carry their flakiness for no incremental coverage.

prajwolrg · 2026-05-26T00:42:55Z

Moving this to draft for now.

This duplication hints at a deeper issue.

You're right. We should handle this more appropriately. Moving this to draft for now.

prajwolrg self-assigned this May 17, 2026

prajwolrg requested review from Rajil1213 and evgenyzdanovich May 17, 2026 16:19

Rajil1213 requested changes May 18, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

prajwolrg added 3 commits May 18, 2026 21:40

prajwolrg force-pushed the fix-flaky-tempdir-lock branch from 719abbb to 8ad081e Compare May 18, 2026 15:55

prajwolrg mentioned this pull request May 19, 2026

test(storage): drop redundant persistence-across-reopen tests #107

Merged

4 tasks

prajwolrg marked this pull request as draft May 26, 2026 00:42

Conversation

prajwolrg commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of Change

Notes to Reviewers

Checklist

Related Issues

Uh oh!

github-actions Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Rajil1213 left a comment

Choose a reason for hiding this comment

Uh oh!

Rajil1213 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Rajil1213 May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Rajil1213 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Rajil1213 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Rajil1213 May 18, 2026

Choose a reason for hiding this comment

Uh oh!

prajwolrg May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Rajil1213 commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

prajwolrg commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

prajwolrg commented May 17, 2026 •

edited

Loading

github-actions Bot commented May 17, 2026 •

edited

Loading

codecov Bot commented May 17, 2026 •

edited

Loading

Rajil1213 May 18, 2026 •

edited

Loading