feat(worker): handle L1 reorgs in ASM block sync#139
Conversation
Codecov Report❌ Patch coverage is
... and 1 file with indirect coverage changes 🚀 New features to boost your workflow:
|
|
Commit: 7795a8a
|
Rename MmrDb to AsmManifestMmrDb and ManifestNodes to AsmManifestMmrNodeStore, and type the db's leaf API in AsmManifestHash instead of Buf32. The Buf32/[u8; 32] boundary now lives at the db layer, which deals only in leaves (manifest hashes). The node store stays at raw [u8; 32]: it holds every MMR node, and internal nodes are combined hashes, not manifest hashes — and MmrNodeStore's MerkleHash bound plus Sha256Hasher pin it to [u8; 32] regardless. This mirrors the ManifestMmrStore re-typing and lets the worker drop its now-redundant conversions.
Replace AsmManifestMmrDb::append_leaf with put_leaf(height, hash), which writes at an explicit leaf index instead of always appending at the end. The MMR is height-indexed, so the caller already knows the leaf index (it equals the L1 block height); taking it as an argument lets the db place the leaf directly rather than appending and having the worker verify the landed index afterward. The worker keeps the misalignment guard (leaf_count must equal height) since put_leaf would otherwise silently overwrite a past height, which the height-to-index alignment forbids.
Drop the misalignment guard from put_manifest_hash so a height below the current leaf count overwrites the existing leaf instead of erroring. Replacing a leaf is expected during an L1 reorg, where the block at an already-seen height is superseded. A gap past the end is still rejected by the underlying put_leaf, so the height-to-index mapping stays intact. Removes the now-unused ManifestMmrMisaligned error.
process_block walked back the chain to the pivot anchor while holding every full block in memory until the forward pass. A deep reorg spanning many blocks could OOM. The backward walk only needs each block's prev_blockhash, so fetch just the header and remember the unprocessed blocks by commitment. The forward pass then fetches each full block one height at a time, keeping only a single block resident. Adds L1DataProvider::get_l1_block_header for the header-only read.
process_block did the whole pivot-lookup-and-replay inline, and its name
implied a single block when it actually brings ASM state up to a target
that may span many blocks or, on a reorg, a different branch.
Rename it to sync_to_block(target) and split its two phases into helpers:
plan_block_processing walks back (reading only headers) to the base
anchor and returns a ProcessingPlan { base_state, base_block, pending };
apply_block runs the STF for one block and persists the results. The
named struct replaces a positional tuple at the call site.
Also documents the two phases, reorg handling, and the forward-order
invariant the manifest MMR relies on for leaf replacement.
The worker's own unit tests need the regtest-backed WorkerContext that until now lived only in the integration-tests harness. Move it into the worker crate as `test_utils`, gated behind the standard `test-utils` feature, so the worker's `cfg(test)` tests and the external integration crate share one implementation instead of letting two copies drift.
Add the STR-1928 service tests, driving the real planner and STF over a regtest node via TestAsmWorkerContext: - plan_block_processing: linear extension, already-processed target, reorg fork-point selection, missing-genesis floor - sync_to_block: pre-genesis no-op, linear advance, resync repositioning, reorg leaf overwrite-in-place, fetch-error propagation - apply_block: single-block manifest/anchor storage, idempotent re-run Also introduces the shared cfg(test) fixtures (setup_state, setup_context, mine, reorg) that the remaining worker unit tests build on.
Complete the STR-1928 service coverage: the SubmitBlock dispatch (Continue on success, ShouldExit plus the error relayed to the caller on failure) and get_status / logs. process_input completes commands via send_blocking, which panics in an async context, so the tests run it on a plain OS thread, mirroring the worker's dedicated thread. With the impl now tested, drop the resolved STR-1928 TODO.
Test AsmWorkerServiceState::new — genesis creation over an empty store, adopting a stored latest on restart, and idempotent MMR prefill — plus a transition smoke test, all on the shared TestAsmWorkerContext fixtures. With those fixtures covering the same ground, drop the bespoke MockWorkerContext (and its setup_env/get_l1_anchor duplicates) that the old state test carried. Factor genesis_params out of setup_state so the restart tests can re-run `new` on a cloned context.
The height-indexed manifest MMR relies on each manifest's leaf index equalling its L1 block height. Assert it right after the append so any drift (a missed sentinel prefill, a reorg off-by-one) trips at the transition instead of silently producing bad inclusion proofs later.
Exercise AuxDataResolver against TestAsmWorkerContext: empty requests, Bitcoin tx fetch and the not-found path, and manifest hash ranges whose proofs are verified against a parallel accumulator. Manifest proofs need a populated height-indexed MMR; Bitcoin tx fetch by txid needs a node with txindex, so add a txindex-enabled context fixture alongside the existing one.
…count A verifier checks the generated MMR proofs against the current state's manifest-accumulator root, so the snapshot leaf count must be that accumulator's own num_entries(). Record the rationale at the call site, where the choice would otherwise read as arbitrary.
2c58a52 to
bdf1cf4
Compare
Drives the full ASM STF over a chain then a shorter reorg branch, exercising the resolver against a real post-reorg accumulator rather than synthetic MMR population. Asserts the safety property: leaves orphaned by the reorg stay in storage (fetchable) but can no longer be proven against the shorter accumulator snapshot — a prover cannot prove a leaf the current chain does not commit to.
The worker is generic over `S: AsmSpec` and touches only the chain view (PoW header continuity + the manifest history accumulator), never any subprotocol. Testing it through the production `StrataAsmSpec` pulled the admin/checkpoint/bridge subprotocol crates into the worker's dev-deps and coupled its tests to their genesis configs and section layout — all for state the worker ignores, so unrelated subprotocol churn could break worker tests. Replace it with a local `TestAsmSpec`/`TestAsmParams` that runs zero subprotocols. The STF still validates headers and appends one manifest per block (distinct leaves, empty section/log payloads), which is exactly the surface the MMR/reorg logic exercises. Drops strata-asm-spec, strata-asm-params, and strata-test-utils-arb from dev-deps.
bdf1cf4 to
53dd5e6
Compare
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 53dd5e64d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| // Drop pivot span guard before next phase | ||
| drop(pivot_span_guard); | ||
| state.update_anchor_state(base_state, base_block); |
There was a problem hiding this comment.
Persist rollbacks to already-stored anchors
When the submitted target is already present in the anchor-state store, such as a reorg that rolls the active tip back to a previously processed ancestor, pending is empty and this only updates the in-memory anchor. The durable latest pointer is only moved by store_anchor_state during apply_block, so a restart will load the old higher/stale branch from AsmWorkerServiceState::new instead of the rollback target. Please also update the stored latest anchor when the processing plan has no blocks to apply.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Good catch — fixed in 2810bc0. When the plan has no pending blocks (a resync, or a reorg rolling the tip back to an already-stored ancestor), sync_to_block now re-commits the base via store_anchor_state, moving the durable latest pointer so a restart resumes from the rollback target instead of reloading the stale higher branch. The write is an idempotent overwrite, same as a re-run apply_block. Added sync_resync_persists_latest_across_restart, which reloads AsmWorkerServiceState over the same store and asserts it resumes from the rollback target.
…lock A resync, or a reorg that rolls the active tip back to an already-stored ancestor, plans no pending blocks, so the forward pass never calls store_anchor_state. That left the in-memory anchor repositioned but the durable `latest` pointer still on the prior, possibly higher, branch — a restart would reload that stale branch instead of resuming from the rollback target. Re-commit the base as the latest anchor when the plan is empty; the write is an idempotent overwrite, same as a re-run apply_block.
Description
Make the ASM worker handle L1 reorgs correctly.
Before this, the manifest MMR was append-only. When a reorg replaced the block at an already-processed height, reprocessing re-appended past the end, the new leaf no longer landed at its height, and the worker bailed with
ManifestMmrMisaligned— and since that error is fatal, the worker shut down. A normal L1 reorg could take the worker offline.This PR positions manifest leaves by height and lets the forward pass overwrite the abandoned branch's leaves in place, so a reorg resyncs cleanly instead of erroring out.
Reorg handling (the core change)
put_leaf(height, hash)replacesappend_leaf. The caller knows the leaf index equals the L1 height, so it places the leaf directly. Aheightat the current end appends; aheightbelow it overwrites in place — the expected case when a reorg replaces the block at an already-seen height. A gap past the end is still rejected. Removes the now-unusedManifestMmrMisalignederror.Memory-bounded pivot lookup
L1DataProvider::get_l1_block_header) and remembers unprocessed blocks by commitment; the forward pass fetches one full block at a time. Previously a deep reorg held every full block in memory and could OOM.Sync structure (supporting the above)
process_blocktosync_to_block(target)and split it intoplan_block_processing(returns aProcessingPlan) andapply_block(runs the STF for one block and persists). The persist order — manifest and aux data first, anchor state last — is the crash-safety contract: a crash before the anchor write leaves the block uncommitted and safely re-runnable, since every write on the path is an idempotent, block-keyed overwrite.Test infrastructure & coverage
TestAsmWorkerContextout of the integration harness into the worker crate behind the standardtest-utilsfeature, so the worker'scfg(test)tests and the external integration crate share one implementation.plan_block_processing/sync_to_block/apply_block(linear, reorg, resync, orphaned-leaf, error paths),process_inputdispatch +get_status, andAsmWorkerServiceState::new(genesis, restart adoption, idempotent MMR prefill).MockWorkerContextnow that the shared fixtures cover the same ground, and drop the resolved STR-1928 TODO.TestAsmSpec/TestAsmParamsrather than the productionStrataAsmSpec. The worker is spec-agnostic (it only touches the chain view), so this decouples its tests from subprotocol genesis configs and section layout, and dropsstrata-asm-spec,strata-asm-params, andstrata-test-utils-arbfrom dev-deps.Type of Change
Related Issues
Jira: STR-3722