Skip to content

fix(asm-runner): catch up to bitcoind tip on restart#105

Draft
prajwolrg wants to merge 3 commits into
mainfrom
fix-flaky-tempdir-lock
Draft

fix(asm-runner): catch up to bitcoind tip on restart#105
prajwolrg wants to merge 3 commits into
mainfrom
fix-flaky-tempdir-lock

Conversation

@prajwolrg
Copy link
Copy Markdown
Collaborator

@prajwolrg prajwolrg commented May 17, 2026

Description

The runner's restart path was broken by an interaction between ZMQ block delivery and persisted state: ZMQ only forwards blocks mined after we subscribe, so when the runner came up with a persisted height below the chain tip, it sat idle waiting for an event that wouldn't fire until someone mined another block. Poll the tip via RPC and backfill the gap before entering the wait loop.

The branch also retires two persistence_across_reopen unit tests in crates/storage that were flaky on Linux (sled 0.34 tempdir-lock race). They duplicated coverage that sled itself provides, and were proxying for end-to-end behavior — runner state surviving stop/start — that we never actually tested. Added a flexitest case (fn_asm_restart_test.py) that does exercise it: drive the runner to height H, stop, mine more blocks, restart, assert it both resumes from persisted state and catches up.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New or updated tests

Notes to Reviewers

The resume-vs-replay assertion in the new test fn_asm_restart_test.py currently grep-matches a log line ("Created genesis manifest") to detect that the runner did not re-bootstrap from genesis. This is fragile — it couples the test to wording in crates/worker/src/service.rs:172 and to the fact that the harness log file is append-mode across restarts.

An alternative considered was exposing a discriminator directly through the existing status RPC.

pub enum StartupKind {
    Bootstrapped,
    Resumed { from_height: u64 },
}

pub struct AsmWorkerStatus {
    // ...existing fields...
    pub startup_kind: StartupKind,
}

However, it expands the public RPC surface for what is, today, a primarily test-driven need, and an API contract decision is worth its own discussion rather than slipping in alongside a hang fix. The current test catches the regression we care about and will fail loudly (not silently) if the log line moves. Happy to do the RPC change as a follow-up if reviewers agree on the shape.

The run_test.sh change is collateral: the set -u script trips over "${CARGO_ARGS[@]}" when the array is empty (native backend). Fixed in passing so the new test can run.

Checklist

  • I have performed a self-review of my code.
  • I have commented my code where necessary.
  • I have updated the documentation if needed.
  • My changes do not introduce new warnings.
  • I have added tests that prove my changes are effective or that my feature works.
  • New and existing tests pass with my changes.

Related Issues

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 17, 2026

Commit: 2e7d93f
SP1 Execution Results

program cycles gas
asm-stf 130,103,924 129,731,987
moho 5,191,380 5,499,715

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 74.19355% with 8 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
bin/asm-runner/src/block_watcher.rs 74.19% 8 Missing ⚠️
Files with missing lines Coverage Δ
crates/storage/src/export_entries.rs 98.52% <ø> (-0.14%) ⬇️
crates/storage/src/mmr.rs 100.00% <ø> (ø)
bin/asm-runner/src/block_watcher.rs 83.47% <74.19%> (+0.32%) ⬆️

... and 10 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@prajwolrg prajwolrg self-assigned this May 17, 2026
Copy link
Copy Markdown

@Rajil1213 Rajil1213 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The backfilling API can be improved. Just using the btc_fetcher crate directly is also an option.

.get_block_count()
.await
.context("failed to query bitcoind tip for startup catchup")?;
if tip_height >= cursor {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't > sufficient? Was the duplication at the boundary (tip_height == cursor) deliberate?

Comment on lines +114 to +119
backfill_range(
&bitcoin_client,
&asm_worker,
&proof_tx,
cursor..=received_height - 1,
)
Copy link
Copy Markdown

@Rajil1213 Rajil1213 May 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This duplication hints at a deeper issue. Right now, the block watcher muddies the boundary between the bitcoin block fetcher and block submitter. A cleaner way to do this is to separate the two. If you look at the btc_tracker crate in strata-bridge, you'll find a loop that just ingests blocks from bitcoind, then there is a separate block that pushes blocks to subscribers. The backfilling logic belongs on the former (that is responsible for receiving blocks from bitcoind), the latter just takes those blocks and pushes them out.

This also means that you don't need a brittle functional test. You can just add a proptest that asserts that block production is contiguous.

Downstream consumers (in this case the asm worker) can just depend on the fact that provided a start height, it will always receive blocks in the order of block heights without any discontinuity.

Comment on lines +139 to +146
for height in range {
let block = fetch_block_at_height(client, height)
.await
.with_context(|| format!("backfill fetch failed at height {height}"))?;
submit_block(asm_worker, proof_tx, block)
.await
.with_context(|| format!("backfill submit failed at height {height}"))?;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this just depends on the block height, will there be issues if there is a fork around the current tip?

Comment on lines +56 to +95
# Snapshot a processed block we expect to survive the restart.
snapshot_height = initial_btc_height + 1
snapshot_hash = bitcoin_rpc.proxy.getblockhash(snapshot_height)
pre_state = asm_rpc.strata_asm_getAsmState(snapshot_hash)
assert pre_state is not None, (
f"strata_asm_getAsmState returned None at height {snapshot_height} pre-restart"
)

# Mark where the post-restart slice of the log file begins. The runner
# appends to this file across stop/start, so a byte offset captured now
# cleanly partitions pre- vs post-restart output.
log_offset = os.path.getsize(log_path)

logging.info("stopping ASM runner at height %s", pre_restart_height)
asm_service.stop()

# Mine while the runner is down so it has to catch up on restart —
# exercises the watcher's gap-fill path, not just steady state.
catchup_blocks = 2
bitcoin_rpc.proxy.generatetoaddress(catchup_blocks, wallet_addr)
post_restart_target = pre_restart_height + catchup_blocks

logging.info("restarting ASM runner")
asm_service.start()
asm_rpc = asm_service.create_rpc()
wait_until_asm_ready(asm_rpc)
wait_until_asm_reaches_height(asm_rpc, min_height=post_restart_target)
logging.info("ASM caught up past restart to height %s", post_restart_target)

# Resume vs replay: the genesis-bootstrap line only fires when the
# worker can't find an existing genesis manifest. If the post-restart
# log slice contains it, the runner threw away persisted state and
# rebuilt from scratch — exactly the failure mode the test is for.
with open(log_path, "rb") as f:
f.seek(log_offset)
post_log = f.read().decode("utf-8", errors="replace")
assert GENESIS_BOOTSTRAP_MARKER not in post_log, (
f"runner re-emitted {GENESIS_BOOTSTRAP_MARKER!r} after restart — "
"it restarted from genesis instead of resuming from persisted state"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might also want to check out the log matcher utility in strata-bridge which wraps the above logic in wait_until (not necessary here because you're only checking the logs after the ASM catches up).

This is of course very brittle.

# identical post-restart. Weaker than the log check on its own (a
# fresh replay would produce the same payload on the same chain), but
# catches durability regressions where the data is gone entirely.
post_state = asm_rpc.strata_asm_getAsmState(snapshot_hash)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also make sense to query for an intermediate height (that is one of the blocks that got backfilled)? I'm thinking not because the ASM cannot progress without block continuity.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, ASM cannot progress without block continuity.

@Rajil1213
Copy link
Copy Markdown

@codex review.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 719abbbc71

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +65 to +67
if tip_height >= cursor {
backfill_range(&bitcoin_client, &asm_worker, &proof_tx, cursor..=tip_height).await?;
cursor = tip_height + 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor shutdown during startup catch-up

When tip_height >= cursor, the watcher runs backfill_range(...) before entering the tokio::select! loop that listens for shutdown.wait_for_shutdown(). If the node is far behind (large restart gap), this RPC/backfill loop can run for a long time and the process will ignore shutdown requests until it finishes, which can make service stop/restart operations hang under backlog conditions. Please make startup catch-up cancellable (e.g., check shutdown between heights or interleave catch-up with the shutdown wait).

Useful? React with 👍 / 👎.

prajwolrg added 3 commits May 18, 2026 21:40
The ZMQ subscription only delivers blocks mined after we subscribe, so any
heights between the worker's last persisted block and the current chain
tip would otherwise wait indefinitely for a fresh event. Restarting after
the chain had moved on left the runner hung with no recovery short of
another block being mined live.

Poll the tip via RPC and backfill the gap before entering the ZMQ wait
loop. The in-loop backfill (rare ZMQ drops) and the new startup catchup
now share a single `backfill_range` helper that bails on first failure to
avoid handing the worker a gap.
Restart semantics live at the binary boundary — the worker reloads from
sled, resumes from the last persisted block, and reconnects to bitcoind.
The storage-layer unit tests only re-check sled's own durability and
never exercise this path.

Drive the runner past a few blocks, stop it, mine more while it's down,
and assert it resumes from persisted state and catches up. The
resume-vs-replay discriminator currently grep-matches the genesis-bootstrap
log line; see the test's docstring for the rationale and a pointer to a
follow-up that should surface this through the status RPC instead.

Also expose the runner log path via the asm_rpc factory (needed by the
test) and guard CARGO_ARGS expansion in run_test.sh so the native-backend
path doesn't trip `set -u` when no extra cargo flags are set.
These exercised sled's own file-lock and durability across reopen, which
sled already covers. They were also flaky on Linux due to a sled 0.34
race on tempdir cleanup — papered over in the test body with explicit
`drop` and `flush` calls.

End-to-end persistence is now covered at the level we actually care
about (runner reload across stop/start) by the new asm-runner restart
test, so these unit tests carry their flakiness for no incremental
coverage.
@prajwolrg
Copy link
Copy Markdown
Collaborator Author

Moving this to draft for now.

This duplication hints at a deeper issue.

You're right. We should handle this more appropriately. Moving this to draft for now.

@prajwolrg prajwolrg marked this pull request as draft May 26, 2026 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants