Skip to content

feat(worker): handle L1 reorgs in ASM block sync#139

Open
prajwolrg wants to merge 15 commits into
mainfrom
STR-3722-manifest-mmr-reorg
Open

feat(worker): handle L1 reorgs in ASM block sync#139
prajwolrg wants to merge 15 commits into
mainfrom
STR-3722-manifest-mmr-reorg

Conversation

@prajwolrg
Copy link
Copy Markdown
Collaborator

@prajwolrg prajwolrg commented Jun 7, 2026

Description

Make the ASM worker handle L1 reorgs correctly.

Before this, the manifest MMR was append-only. When a reorg replaced the block at an already-processed height, reprocessing re-appended past the end, the new leaf no longer landed at its height, and the worker bailed with ManifestMmrMisaligned — and since that error is fatal, the worker shut down. A normal L1 reorg could take the worker offline.

This PR positions manifest leaves by height and lets the forward pass overwrite the abandoned branch's leaves in place, so a reorg resyncs cleanly instead of erroring out.

Reorg handling (the core change)

  • Manifest MMR is now height-indexed: put_leaf(height, hash) replaces append_leaf. The caller knows the leaf index equals the L1 height, so it places the leaf directly. A height at the current end appends; a height below it overwrites in place — the expected case when a reorg replaces the block at an already-seen height. A gap past the end is still rejected. Removes the now-unused ManifestMmrMisaligned error.
  • Block sync walks the target's ancestry back to the base — the most recent ancestor with a stored anchor state. On a reorg that base is the fork point, so the abandoned branch is never revisited; the forward pass then overwrites each superseded leaf from the fork point forward before any later height is written, so a stale leaf never outlives the chain it belonged to.
  • STF now asserts the history accumulator's last-inserted height tracks the verified block height, catching any height/index drift at the source.

Memory-bounded pivot lookup

  • The backward walk to the base now fetches only block headers (new L1DataProvider::get_l1_block_header) and remembers unprocessed blocks by commitment; the forward pass fetches one full block at a time. Previously a deep reorg held every full block in memory and could OOM.

Sync structure (supporting the above)

  • Rename process_block to sync_to_block(target) and split it into plan_block_processing (returns a ProcessingPlan) and apply_block (runs the STF for one block and persists). The persist order — manifest and aux data first, anchor state last — is the crash-safety contract: a crash before the anchor write leaves the block uncommitted and safely re-runnable, since every write on the path is an idempotent, block-keyed overwrite.

Test infrastructure & coverage

  • Move the regtest-backed TestAsmWorkerContext out of the integration harness into the worker crate behind the standard test-utils feature, so the worker's cfg(test) tests and the external integration crate share one implementation.
  • Cover the service end to end, including the reorg paths: plan_block_processing / sync_to_block / apply_block (linear, reorg, resync, orphaned-leaf, error paths), process_input dispatch + get_status, and AsmWorkerServiceState::new (genesis, restart adoption, idempotent MMR prefill).
  • Retire the bespoke MockWorkerContext now that the shared fixtures cover the same ground, and drop the resolved STR-1928 TODO.
  • Drive the worker's own tests with a local no-subprotocol TestAsmSpec/TestAsmParams rather than the production StrataAsmSpec. The worker is spec-agnostic (it only touches the chain view), so this decouples its tests from subprotocol genesis configs and section layout, and drops strata-asm-spec, strata-asm-params, and strata-test-utils-arb from dev-deps.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature/Enhancement (non-breaking change which adds functionality or enhances an existing one)
  • Refactor
  • New or updated tests

Related Issues

Jira: STR-3722

@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 7, 2026

Codecov Report

❌ Patch coverage is 98.79195% with 9 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/worker/src/test_utils.rs 97.48% 6 Missing ⚠️
bin/asm-runner/src/worker_context.rs 88.23% 2 Missing ⚠️
crates/worker/src/service.rs 99.67% 1 Missing ⚠️
Files with missing lines Coverage Δ
bin/asm-runner/src/storage.rs 100.00% <100.00%> (ø)
crates/stf/src/transition.rs 97.82% <100.00%> (+0.15%) ⬆️
crates/storage/src/mmr.rs 96.82% <100.00%> (+0.13%) ⬆️
crates/worker/src/aux_resolver.rs 93.93% <100.00%> (+10.06%) ⬆️
crates/worker/src/state.rs 100.00% <100.00%> (+24.86%) ⬆️
crates/worker/src/traits.rs 100.00% <ø> (ø)
crates/worker/src/service.rs 97.99% <99.67%> (+12.45%) ⬆️
bin/asm-runner/src/worker_context.rs 67.75% <88.23%> (+1.28%) ⬆️
crates/worker/src/test_utils.rs 97.48% <97.48%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 7, 2026

Commit: 7795a8a
SP1 Execution Results

program cycles gas
asm-stf 136,367,061 141,452,312
moho 5,223,555 5,525,314

@prajwolrg prajwolrg changed the title refactor(worker): height-indexed manifest MMR, phased block sync, and worker-service tests feat(worker): handle L1 reorgs in ASM block sync Jun 7, 2026
prajwolrg added 12 commits June 7, 2026 10:52
Rename MmrDb to AsmManifestMmrDb and ManifestNodes to
AsmManifestMmrNodeStore, and type the db's leaf API in AsmManifestHash
instead of Buf32.

The Buf32/[u8; 32] boundary now lives at the db layer, which deals only
in leaves (manifest hashes). The node store stays at raw [u8; 32]: it
holds every MMR node, and internal nodes are combined hashes, not
manifest hashes — and MmrNodeStore's MerkleHash bound plus Sha256Hasher
pin it to [u8; 32] regardless. This mirrors the ManifestMmrStore
re-typing and lets the worker drop its now-redundant conversions.
Replace AsmManifestMmrDb::append_leaf with put_leaf(height, hash), which
writes at an explicit leaf index instead of always appending at the end.

The MMR is height-indexed, so the caller already knows the leaf index (it
equals the L1 block height); taking it as an argument lets the db place
the leaf directly rather than appending and having the worker verify the
landed index afterward. The worker keeps the misalignment guard
(leaf_count must equal height) since put_leaf would otherwise silently
overwrite a past height, which the height-to-index alignment forbids.
Drop the misalignment guard from put_manifest_hash so a height below the
current leaf count overwrites the existing leaf instead of erroring.
Replacing a leaf is expected during an L1 reorg, where the block at an
already-seen height is superseded. A gap past the end is still rejected
by the underlying put_leaf, so the height-to-index mapping stays intact.

Removes the now-unused ManifestMmrMisaligned error.
process_block walked back the chain to the pivot anchor while holding
every full block in memory until the forward pass. A deep reorg spanning
many blocks could OOM.

The backward walk only needs each block's prev_blockhash, so fetch just
the header and remember the unprocessed blocks by commitment. The forward
pass then fetches each full block one height at a time, keeping only a
single block resident. Adds L1DataProvider::get_l1_block_header for the
header-only read.
process_block did the whole pivot-lookup-and-replay inline, and its name
implied a single block when it actually brings ASM state up to a target
that may span many blocks or, on a reorg, a different branch.

Rename it to sync_to_block(target) and split its two phases into helpers:
plan_block_processing walks back (reading only headers) to the base
anchor and returns a ProcessingPlan { base_state, base_block, pending };
apply_block runs the STF for one block and persists the results. The
named struct replaces a positional tuple at the call site.

Also documents the two phases, reorg handling, and the forward-order
invariant the manifest MMR relies on for leaf replacement.
The worker's own unit tests need the regtest-backed WorkerContext that
until now lived only in the integration-tests harness. Move it into the
worker crate as `test_utils`, gated behind the standard `test-utils`
feature, so the worker's `cfg(test)` tests and the external integration
crate share one implementation instead of letting two copies drift.
Add the STR-1928 service tests, driving the real planner and STF over a
regtest node via TestAsmWorkerContext:

- plan_block_processing: linear extension, already-processed target,
  reorg fork-point selection, missing-genesis floor
- sync_to_block: pre-genesis no-op, linear advance, resync repositioning,
  reorg leaf overwrite-in-place, fetch-error propagation
- apply_block: single-block manifest/anchor storage, idempotent re-run

Also introduces the shared cfg(test) fixtures (setup_state, setup_context,
mine, reorg) that the remaining worker unit tests build on.
Complete the STR-1928 service coverage: the SubmitBlock dispatch
(Continue on success, ShouldExit plus the error relayed to the caller on
failure) and get_status / logs. process_input completes commands via
send_blocking, which panics in an async context, so the tests run it on a
plain OS thread, mirroring the worker's dedicated thread.

With the impl now tested, drop the resolved STR-1928 TODO.
Test AsmWorkerServiceState::new — genesis creation over an empty store,
adopting a stored latest on restart, and idempotent MMR prefill — plus a
transition smoke test, all on the shared TestAsmWorkerContext fixtures.

With those fixtures covering the same ground, drop the bespoke
MockWorkerContext (and its setup_env/get_l1_anchor duplicates) that the
old state test carried. Factor genesis_params out of setup_state so the
restart tests can re-run `new` on a cloned context.
The height-indexed manifest MMR relies on each manifest's leaf index
equalling its L1 block height. Assert it right after the append so any
drift (a missed sentinel prefill, a reorg off-by-one) trips at the
transition instead of silently producing bad inclusion proofs later.
Exercise AuxDataResolver against TestAsmWorkerContext: empty requests,
Bitcoin tx fetch and the not-found path, and manifest hash ranges whose
proofs are verified against a parallel accumulator.

Manifest proofs need a populated height-indexed MMR; Bitcoin tx fetch by
txid needs a node with txindex, so add a txindex-enabled context fixture
alongside the existing one.
…count

A verifier checks the generated MMR proofs against the current state's manifest-accumulator root, so the snapshot leaf count must be that accumulator's own num_entries(). Record the rationale at the call site, where the choice would otherwise read as arbitrary.
@prajwolrg prajwolrg force-pushed the STR-3722-manifest-mmr-reorg branch from 2c58a52 to bdf1cf4 Compare June 7, 2026 05:08
prajwolrg added 2 commits June 7, 2026 10:53
Drives the full ASM STF over a chain then a shorter reorg branch, exercising the resolver against a real post-reorg accumulator rather than synthetic MMR population. Asserts the safety property: leaves orphaned by the reorg stay in storage (fetchable) but can no longer be proven against the shorter accumulator snapshot — a prover cannot prove a leaf the current chain does not commit to.
The worker is generic over `S: AsmSpec` and touches only the chain view
(PoW header continuity + the manifest history accumulator), never any
subprotocol. Testing it through the production `StrataAsmSpec` pulled the
admin/checkpoint/bridge subprotocol crates into the worker's dev-deps and
coupled its tests to their genesis configs and section layout — all for
state the worker ignores, so unrelated subprotocol churn could break
worker tests.

Replace it with a local `TestAsmSpec`/`TestAsmParams` that runs zero
subprotocols. The STF still validates headers and appends one manifest
per block (distinct leaves, empty section/log payloads), which is exactly
the surface the MMR/reorg logic exercises. Drops strata-asm-spec,
strata-asm-params, and strata-test-utils-arb from dev-deps.
@prajwolrg prajwolrg force-pushed the STR-3722-manifest-mmr-reorg branch from bdf1cf4 to 53dd5e6 Compare June 7, 2026 05:08
@prajwolrg prajwolrg marked this pull request as ready for review June 7, 2026 05:15
@prajwolrg prajwolrg requested review from evgenyzdanovich and irnb June 7, 2026 05:23
@prajwolrg prajwolrg self-assigned this Jun 7, 2026
@prajwolrg
Copy link
Copy Markdown
Collaborator Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 53dd5e64d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


// Drop pivot span guard before next phase
drop(pivot_span_guard);
state.update_anchor_state(base_state, base_block);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Persist rollbacks to already-stored anchors

When the submitted target is already present in the anchor-state store, such as a reorg that rolls the active tip back to a previously processed ancestor, pending is empty and this only updates the in-memory anchor. The durable latest pointer is only moved by store_anchor_state during apply_block, so a restart will load the old higher/stale branch from AsmWorkerServiceState::new instead of the rollback target. Please also update the stored latest anchor when the processing plan has no blocks to apply.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 2810bc0. When the plan has no pending blocks (a resync, or a reorg rolling the tip back to an already-stored ancestor), sync_to_block now re-commits the base via store_anchor_state, moving the durable latest pointer so a restart resumes from the rollback target instead of reloading the stale higher branch. The write is an idempotent overwrite, same as a re-run apply_block. Added sync_resync_persists_latest_across_restart, which reloads AsmWorkerServiceState over the same store and asserts it resumes from the rollback target.

…lock

A resync, or a reorg that rolls the active tip back to an already-stored
ancestor, plans no pending blocks, so the forward pass never calls
store_anchor_state. That left the in-memory anchor repositioned but the
durable `latest` pointer still on the prior, possibly higher, branch — a
restart would reload that stale branch instead of resuming from the
rollback target. Re-commit the base as the latest anchor when the plan is
empty; the write is an idempotent overwrite, same as a re-run apply_block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant