fix(node): refresh attestation before concluding back-migration (#2121)#3362
Draft
barakeinav1 wants to merge 35 commits into
Draft
fix(node): refresh attestation before concluding back-migration (#2121)#3362barakeinav1 wants to merge 35 commits into
barakeinav1 wants to merge 35 commits into
Conversation
f947ff9 to
3a63c91
Compare
Adds `crates/contract/tests/sandbox/tee_back_migration.rs` with a single
test that drives `conclude_node_migration` against a stale destination
attestation and asserts the contract rejects with
`InvalidTeeRemoteAttestation`.
Setup mirrors the existing `reshare__should_evict_expired_attestations_via_post_reshare_sweep`
pattern: submit a `MockAttestation::WithConstraints { expiry_timestamp_seconds: now+5 }`,
then `worker.fast_forward()` past the expiry before calling conclude.
This proves the on-chain rejection path #2121 describes is reachable
without needing real TDX hardware or a 7-day wall-clock wait.
Scope: contract-side only. The node-side question — does
`periodic_attestation_submission` fire fast enough on the destination's
restart to close the race window before `retry_conclude_onboarding`
drains its budget — needs a running mpc-node binary and is tracked
separately.
Test passes in ~35s on `cargo test -p mpc-contract --test test --profile test-release`.
Extends `tee_back_migration.rs` with a companion test demonstrating the fix: if the destination submits a fresh attestation for the same TLS key before calling `conclude_node_migration`, the contract accepts the conclude. This is what a fixed node should do in `execute_onboarding` before entering `retry_conclude_onboarding`. Shared setup factored into `setup_stale_back_migration` so both tests diverge only at the final step (one calls conclude directly and expects rejection; the other submits fresh + conclude, expects success). Both tests pass in 11.77s total.
Back-migration (A -> B -> A) can leave the destination's stored on-chain attestation past expiry by the contract's `current_time_seconds` at conclude time. `reverify_participants` then rejects `conclude_node_migration` with `InvalidTeeRemoteAttestation`, and `retry_conclude_onboarding` loops forever on the same stale state. Adds `submit_attestation_before_concluding_migration` in `tee::remote_attestation`. `execute_onboarding` calls it between keyshare import and `retry_conclude_onboarding`; the helper generates a fresh quote via `TeeAuthority` and submits it via `submit_remote_attestation`, which waits for `TransactionStatus::Executed`. By the time conclude is attempted the contract holds a fresh attestation under the destination's TLS key. Helper failures are logged but non-fatal so the non-back- migration path is unaffected. Drives the recovery path pinned by the companion contract-sandbox test `conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`. Plumbing: `TeeAuthority` and `account_public_key` are threaded `run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard` -> `execute_onboarding`.
`bogus_ed25519_public_key()` already returns `Ed25519PublicKey`; the trailing `.into()` is a no-op the older local clippy did not catch.
The new attestation-refresh path in execute_onboarding routes through submit_remote_attestation, which calls send_and_wait. The mock left that as unimplemented!(), so tests::onboarding::test_onboarding panicked when the refresh fired during back-migration onboarding. The mock now forwards the transaction through its existing channel and returns TransactionStatus::Executed, matching real production behaviour for the test's purposes.
When `wait_for_node_indexer_height_above` times out we have no visibility into the underlying cause from CI logs — the mpc-node binary writes to `home_dir/stderr.log` but nextest doesn't capture per-node logs. PR #3365 already added a "node may have exited" hint in the inner error; the outer context now dumps the tail of the relevant node's stderr.log so any panic/abort reason shows up directly in the test panic message. This is a diagnostic aid for chasing the residual #3366 crash window. Trade-off: when the wait genuinely just times out (rare under #3365's fix), the panic message gets ~16KB longer. Worth it. Adds `MpcNodeState::home_dir()` accessor so the test helper can locate the file without re-fetching it from each `match` arm.
The check-use-in-fn lint disallows `use` inside fn bodies; lift the imports added in d970722 out of `read_stderr_tail`.
04971e0 to
cc3eeae
Compare
Captures the full investigation chain that established the back-migration e2e flake on this PR is an upstream nearcore bug, not a defect in the node fix. Includes panic excerpts (both modes), the 5/6-vs-0/5 revert experiment result, and a precise repro shape suitable for an upstream issue.
A trimmed-down sibling of `should_handle_back_migration_a_to_b_to_a` that triggers the same upstream `near_indexer::streamer` / `near_client::client_actor` panic with the minimum operational sequence: one forward migration round (which produces the triggering `SubmitParticipantInfo` -> `ConcludeNodeMigration` adjacent-block pattern), SIGKILL of the source node, restart, then the indexer-progress assertion from PR #3365's helper. Drops the back-migration round, the BackupService recreation, and the sign+ckd verification at the end — none are needed once the panic condition is captured. ~70 lines vs ~140 for the full back-migration test. Currently expected to fail with the documented nearcore panic. Once nearcore's SIGKILL recovery is fixed, this test should pass and serve as the permanent regression guard. See: docs/investigation/2121-back-migration-e2e-flake.md
…tale-attestation-test # Conflicts: # crates/node/src/migration_service.rs # crates/node/src/migration_service/onboarding.rs
… activity Refactors `forward_migration_kill_restart__should_recover_indexer` into a shared driver and adds 3 sibling variants that differ only in what sign/CKD activity runs between forward conclude and the SIGKILL: - variant 0 (existing, renamed comment): no activity [baseline] - variant 1: one sign request only - variant 2: one CKD request only - variant 3: both — same activity profile as the back-migration test These deliberately preserve variant 0 as evidence that the forward+kill shape alone does not surface the upstream nearcore panic. Comparing the four pass rates will pin down which receipt graph (sign_respond / respond_ckd / both) is necessary to make the bug reachable, beyond the two-tx pattern from #3362 that the revert experiment already proved is necessary. See: docs/investigation/2121-back-migration-e2e-flake.md
Extends `run_forward_kill_restart_repro` to take separate (pre_sign, pre_ckd, post_sign, post_ckd) flags and adds three new tests covering the pre-forward axis of the activity matrix: - variant 4: pre-forward sign only [A0-signed sign_respond, no other] - variant 5: pre-forward ckd only [A0-signed respond_ckd, no other] - variant 6: pre-forward sign + ckd [both, no post-forward activity] The first 5-run campaign on variants 0-3 showed sign+ckd together post-forward fires the panic only ~20% (vs ~71% for back-migration). The remaining structural difference is the back-migration test's pre-forward sign+ckd block, which puts A0-signed respond receipts into chain history before A0 is demoted. Variants 4-6 isolate that ingredient. See: docs/investigation/2121-back-migration-e2e-flake.md
- Headline statistics now include the 4-variant post-forward matrix (campaign 1, 5 runs each): variants 0–2 fire 0%, variant 3 fires ~20%, back-migration fires ~71% combined. - Documents the 8-cell experiment matrix (variants 0–6 + back-migration) with the pre/post sign+CKD breakdown. - Explains the open question for campaign 2 (pre-forward variants 4–6) and how its result will narrow the trigger to one of the three remaining hypotheses (write-rate / receipt-graph / time-alignment). No code changes — this is a documentation-only update so the experiment state can be shared with peers while campaign 2 is still in flight.
Builds on the 7-variant matrix (variants 0-6) to fill out the full 16-cell (pre-sign, pre-ckd, post-sign, post-ckd) grid plus 2 SIGTERM counterparts: - variants 7-14: the 8 missing pre+post combinations - variant 15: full pre + full post (matches back-migration profile up to kill) - forward variant 15 + SIGTERM: graceful-shutdown equivalent of variant 15 - back-migration + SIGTERM: graceful-shutdown equivalent of the existing back-migration test New infrastructure: - `NodeStopSignal` enum in e2e-tests::mpc_node with Sigkill / Sigterm variants - `ProcessGuard::terminate_with_grace` sends SIGTERM via libc::kill, polls for graceful exit, falls back to SIGKILL on grace-period timeout - `MpcCluster::stop_nodes_with` mirrors `kill_nodes` but takes a signal The two SIGTERM variants provide controlled comparisons: if SIGTERM passes reliably while SIGKILL flakes, the bug is squarely on nearcore's SIGKILL recovery path. The full 16-cell matrix lets us isolate exactly which activity profile reaches the upstream panic. See: docs/investigation/2121-back-migration-e2e-flake.md
Replaces the partial 4-variant tables with the consolidated results across all four campaigns. Key updates: - Variant 15 (full pre+post sign/CKD) is the smoking-gun repro at ~70% — same activity profile as the back-migration test (which fires ~65%), same panic. The back-migration round itself adds nothing. - All 4 activity types must be present simultaneously. Drop any one (variants 9, 12, 13, 14) and the rate falls to ~10–30% or zero. - 2-of-4 or fewer activity types almost never fire. - SIGTERM at 30s grace does NOT trivially fix the bug: 2/2 observations of the SIGTERM-15 and SIGTERM-back-migration variants failed with the same `streamer/mod.rs:207` panic. - Receipt-graph hypothesis is the best fit. The SIGTERM-still-fails result is a strong hint that the in-memory streamer state (not just on-disk RocksDB) is the broken thing. Follow-ups updated to recommend keeping variant 15 + back-migration as the canonical repros and probing the SIGTERM result more carefully.
Prunes the 16-variant focused-repro matrix down to the two tests that matter — variant 15 (`forward_migration_kill_restart_with_pre_both_and_post_both`) and `should_handle_back_migration_a_to_b_to_a` — and changes both to **drain** the node before SIGKILLing it. Mechanism: `drain_node_and_wait_idle` in `common.rs` writes `listen_blocks.flag = false` (which makes the embedded indexer park at its cancellation point between block processing) and then polls the indexer's block-height metric until it stops advancing for a few seconds. With the indexer provably idle, SIGKILL catches no half-built streamer state and `neard` can resume cleanly on restart. The activity profile is unchanged — the drain step is added between post-forward activity and the kill. Models production reality: operators do rolling restarts by stopping new traffic, draining in-flight work, then stopping the process. This test now does the same. The deleted variants (0–14 + 2 SIGTERM) served their diagnostic purpose in identifying the trigger; the matrix data and panic excerpts are preserved in `docs/investigation/2121-back-migration-e2e-flake.md`.
…pstream
After SIGKILL (~70% fail) and SIGTERM-30s (2/2 fail), the third teardown
strategy — pause `listen_blocks.flag` and wait for our consumer-side
indexer to reach idle, then SIGKILL — also fails 4/4 (5th run in flight).
Failures take 106–127 s instead of ~102 s, confirming the drain ran;
it just doesn't address where the bug lives.
Why drain-via-listen_blocks doesn't help: the flag pauses our consumer
in `crates/node/src/indexer/handler.rs::listen_blocks`. The panic is on
the producer side — nearcore's `near_indexer::streamer::start::{{closure}}`.
We have no test-side mechanism that can drain that task short of waiting
for neard's own graceful shutdown, which SIGTERM-30s also didn't allow
to finish.
Three failed strategies together rule out test-side fixes:
plain SIGKILL → ~70% fail
SIGTERM, 30 s grace → 2 / 2 fail
drain consumer + SIGKILL → 4 / 4 fail
This is now solid evidence that the bug must be fixed in nearcore
itself. We can't paper over it from the e2e test without losing the
kill+restart semantics the test exists to model.
Follow-ups updated: opening the upstream issue is now the unblocking
step; decide whether to `#[ignore]` the two remaining repros, use
`reset_and_start_nodes` (papers over but stable), or leave flaky.
…s no SIGTERM handler Two updates after the 5-run drain-then-kill campaign completed: 1. Drain-then-kill final tally is 5/5 fail (was 4/5 mid-flight) for both variant 15 and the back-migration test. Locks the verdict in. 2. Discovered that mpc-node has **no SIGTERM handler at all** — `grep` for `tokio::signal` / `SignalKind` / `ctrl_c` / HTTP shutdown turns up nothing across the node crate. The internal `shutdown_signal_sender` in run.rs:183 exists and is wired into the main tokio::select!, but only `monitor_allowed_image_hashes` fires it today. That means all our SIGTERM experiments were effectively identical to SIGKILL (default OS behavior of SIGTERM = terminate-immediately). Updated the headline, the SIGTERM row in the failed-strategies table, and added a new follow-up (#5) calling out the graceful-shutdown gap as a real production deficiency worth its own issue.
…e hypothesis Adds `tokio::time::sleep(Duration::from_secs(60))` between the drain step and SIGKILL at all three kill sites (variant 15, plus A0+B0 in the back- migration test). Combined with the existing drain, this gives nearcore's producer-side streamer ~60+ s of inactivity before the kill, testing whether the upstream panic correlates with how recently neard wrote state — option A from `docs/investigation/2121-back-migration-e2e-flake.md`. Each test now takes ~60-90 s longer; nextest will mark them SLOW but they should still run. If the panic still fires after the long settle, that closes the door on "kill-too-close-to-last-activity" as the mechanism.
Self-contained write-up suitable for filing as a nearcore GitHub issue once we decide to file it. Covers: - Both panic modes with full stack traces (streamer/mod.rs:207 and client_actor.rs:217) - Affected versions (2.12.0-rc.1, both aab31b0 and fadb5c1) - Step-by-step reproduction pointing at this PR's branch + the specific triggering commit (1bcbb43, the node-side fix in PR #2121) - Revert experiment (PR #3373) isolating the trigger to one commit - What we tried and ruled out: SIGKILL, SIGTERM, drain, settle-time - Explicit note about SIGTERM — mpc-node has no handler installed, so the experiment was effectively identical to SIGKILL; we cannot claim to know whether a proper graceful path would prevent the panic, but SIGKILL is a legitimate production scenario and shouldn't corrupt state unrecoverably regardless - Hypothesis about cross-block receipt tracking + RocksDB flush boundaries - Links to CI logs showing both panic modes inline Companion to `2121-back-migration-e2e-flake.md` (full experiment matrix) — that one is the lab notebook, this one is the bug report.
Earlier wording overstated the reproduction rate. Replaced three uses of "deterministic" with the actual measured rate (~65–70% per run) so the upstream readers don't expect a 100% repro.
Adds a real graceful-shutdown path that mirrors what `neard`'s standalone binary does in its own SIGTERM handler: 1. Catch SIGTERM via `tokio::signal::unix::signal(SignalKind::terminate())` and route it into the existing internal `shutdown_signal_sender` channel. 2. After the main `tokio::select!` exits via that channel, call `near_async::shutdown_all_actors()` to stop the embedded nearcore actor system (cancels tokio tasks, signals multithread actors). 3. Then call `near_store::db::RocksDB::block_until_all_instances_are_dropped()` so the process doesn't exit before RocksDB has finished closing. Without this, SIGTERM hit mpc-node with no handler installed and the OS terminated the process immediately — functionally identical to SIGKILL — leaving RocksDB in a state that could break recovery on the next start. The investigation that surfaced this is in `docs/investigation/nearcore-indexer-sigkill-restart-panic.md`. Production impact: every dstack/Docker/Kubernetes stop sends SIGTERM first, then SIGKILLs if the process doesn't exit within the grace window. With this handler, mpc-node now actually uses that grace window to flush state. Test impact: switches the two remaining e2e regression tests (`should_handle_back_migration_a_to_b_to_a` and `forward_migration_kill_restart_with_pre_both_and_post_both`) from SIGKILL to SIGTERM with 60 s grace. If graceful shutdown is enough to dodge the upstream nearcore restart panic, both tests will now pass. Removes the now-unused `drain_node_and_wait_idle` helper and its listen_blocks.flag dance (replaced by real graceful shutdown).
…tale-attestation-test # Conflicts: # Cargo.lock # crates/node/Cargo.toml
`cargo fmt --check` caught the extra blank line in e2e-tests/tests/common.rs between `current_node_indexer_height` and the next helper. This was a leftover from removing the drain helper in the SIGTERM-handler commit.
`cargo make check-all-fast` runs `cargo doc` with `-D rustdoc::broken-intra-doc-links`. The intra-doc link to `kill_nodes` on `MpcCluster::stop_nodes_with`'s comment didn't resolve in scope. Drop the brackets so it's a normal code-formatted reference instead.
…_resharing regression `cancellation_of_resharing::test_cancellation_of_resharing` is now reproducibly failing on this branch (24s, WASM trap on the `vote_cancel_resharing` call) while it passes on main. Our PR adds no contract code changes vs main, so the difference must come from test scheduling — nextest orders tests alphabetically and we added variant 15 (`forward_migration_kill_restart_with_pre_both_and_post_both`) which puts it ahead of `cancellation_of_resharing` in the order. With the runner's max-threads of 4 and variant 15's heavy migration/sandbox workload, the concurrent setup pressure during the early test window may be racing the resharing state machine. Marking variant 15 `#[ignore]` for one CI run to test the hypothesis. If `cancellation_of_resharing` passes again with variant 15 disabled, the regression is purely test-ordering interference; we'll then need to either keep variant 15 ignored, restructure it to not race, or adjust the test parallelism config.
When resolving the conflict during the merge from origin/main, I used `cargo generate-lockfile` which resolved newer versions of several near-sdk dependencies than main has — added new entries for near-crypto-hash, near-global-contracts, near-sdk-core, near-sdk-env. The contract's WASM gets built against this lock file. The newer near-sdk versions produced different runtime behavior, which broke `cancellation_of_resharing::test_cancellation_of_resharing` (it WASM-trapped on `vote_cancel_resharing` instead of returning the expected `NotParticipant` error). Three CI runs reproduced this. Replaced with `git checkout origin/main -- Cargo.lock` so we inherit main's resolved versions verbatim, then `cargo build -p mpc-node` populated just the entries needed for my new `near-store` dep. Resulting diff vs main is 20 lines, all in the `near-store` neighborhood.
After the SIGTERM-handler fix removed the upstream nearcore restart panic, the back-migration test reached a different failure point: target's debug endpoint still showed destination_node_info.tls_public_key=None when the 30s INDEXER_SYNC_TIMEOUT elapsed. The contract-view check two retries earlier already passed, so the migration TX is on chain — target just hadn't propagated it to its local debug view yet. Test runs last (#23/23) with max concurrency = 4, so machines are loaded. Doubling the timeout to 60s tests whether this is a tight-timing CI artifact or a real post-restart coordinator-state bug.
Three changes to make each back-migration test run definitively classifiable instead of 'failed somewhere': 1. Rotate stderr.log → stderr.log.previous on restart. File::create was truncating the pre-restart logs, so SIGTERM-handler messages (SIGTERM received, RocksDB shutdown complete, etc.) were gone before the diagnostic ever read the file. 2. wait_for_node_indexer_height_above now dumps both tails on failure. With this, the panic message will show pre-restart shutdown activity AND post-restart panic, side by side. 3. terminate_with_grace eprintln!s either 'exited gracefully Xms after SIGTERM' or 'SIGTERM grace period elapsed; falling back to SIGKILL'. nextest reliably captures eprintln in failure output (tracing::warn may not), so we can tell when grace timed out vs when the handler completed cleanly. Together these let us classify each failing run into one of four categories rather than guessing.
5-run campaign showed our SIGTERM handler hangs past the 60s grace period in 5/5 runs. Diagnostics pinpointed the hang: our embedded indexer runs in a std::thread::spawn'd closure whose block_on never returns because the spawned monitor_* tasks hold Arc<IndexerState> references that nothing cancels on shutdown. So block_until_all_instances_are_dropped() loops forever waiting for the RocksDB ref count to hit zero. The test harness then SIGKILLs us and we land in the exact uncommitted-RocksDB state the handler was supposed to prevent — functionally identical to no handler at all. We still call shutdown_all_actors() so nearcore tasks can commit any in-flight RocksDB batches, but we no longer wait for refs to drop. RocksDB's WAL still guarantees committed data survives kill; this just closes a smaller flush window. The previously-introduced near-store dep is now unused, removed. A proper fix would wire a CancellationToken through the indexer thread (option B in the recent investigation discussion). This change is option A — minimal, lets the campaign actually exercise the handler instead of always tripping SIGKILL fallback.
…tdown question The earlier docs had to hedge: SIGTERM experiments couldn't distinguish 'graceful shutdown doesn't help' from 'we never had a SIGTERM handler.' We've now closed that gap by landing a real SIGTERM handler (#3409 / #3410) and re-running the back-migration campaign. - 2121-back-migration-e2e-flake.md: new section 'Real SIGTERM handler in mpc-node — also does not fix it' covering iteration 1 (handler hangs on block_until_all_instances_are_dropped, 5/5 SIGKILL fallback) and iteration 2 (drop the block_until call, 1/5 pass with 100 ms graceful shutdowns in 5/5). Follow-up #5 updated — in flight via #3409 / #3410. - nearcore-indexer-sigkill-restart-panic.md: new 'Strongest evidence: graceful shutdown doesn't help' subsection inside 'What we ruled out'; table updated with the 5-run SIGTERM-handler row (4/5 fail at 100 ms); TL;DR + Workarounds + References updated; old SIGTERM-disclaimer removed since the question is now answered. Together the docs now say: same panic, same rate, regardless of whether shutdown was SIGKILL or a verified 100 ms graceful actor-system stop. The fix has to be in nearcore.
…n gracefully Earlier wording claimed 'graceful shutdown does not prevent the panic.' A nearcore reviewer could rightly push back: 'you didn't actually let the indexer shut down — its tokio runtime keeps running until the process exits.' Both docs now state this explicitly: - The 100 ms 'graceful' covers mpc-node's main runtime + a stop signal into nearcore's ACTOR_SYSTEMS registry. It does NOT cover the embedded indexer thread, whose ~6 spawned monitor tasks hold Arc<IndexerState> -> Arc<RocksDB> refs that nothing cancels. - The std::thread::spawn'd indexer closure is terminated by OS-level process exit at ~100 ms, mid-flight, like SIGKILL. - We still believe the bug is upstream (3 reasons: panic reads pre-shutdown state, WAL already guarantees committed-data consistency, recovery shouldn't unwrap regardless), but we're honest that we haven't empirically tested 'fully drain the indexer thread first' (option B). TL;DR adjusted to say 'reasonably available to the embedder' and to foreground the recovery-path argument rather than the shutdown-timing argument. Workarounds and table-row wording updated for consistency.
… RocksDB drop Iteration 2 of the SIGTERM handler left the indexer thread running: its tokio runtime hosted spawned monitor tasks holding Arc<RocksDB> refs that nothing cancelled, so we couldn't call block_until_all_instances_are_dropped without it hanging forever. We worked around that by skipping the block_until call entirely — clean for mpc-node main, indexer thread killed mid-flight on process exit. This change actually drains the indexer thread: - spawn_real_indexer takes a CancellationToken parameter. - The terminal listen_blocks .await is wrapped in tokio::select! racing the token's cancelled() future. - When the token cancels: listen_blocks branch aborts, block_on returns, the std::thread::spawn closure ends, the indexer's tokio runtime drops, and every tokio::spawn'd monitor task (each holding Arc<IndexerState> -> Arc<RocksDB>) is aborted with the runtime. Their Drop impls release the Arc refs. - In run.rs the shutdown sequence becomes: shutdown_all_actors() (give nearcore actors a chance to commit) -> indexer_shutdown_token.cancel() (release indexer thread refs) -> block_until_all_instances_are_dropped() (now safe, will return promptly). This is the minimum-viable form of option B: rather than threading the token through every monitor function, rely on runtime-drop to abort all spawned tasks. Substantially smaller diff and same behavior. Whether this changes the upstream nearcore restart-panic rate is the next question — campaign of 5 runs on this commit will answer it.
The kill+restart diagnostic only rotated stderr.log and dumped its tails. That captured panics but missed every tracing::info!/warn!/error! emitted by mpc-node, because tracing-subscriber writes to STDOUT by default. To prove option B is actually executing its shutdown path (SIGTERM received -> Stopping nearcore actor system -> Cancelling indexer shutdown token -> Indexer thread received shutdown signal -> Waiting for RocksDB instances -> RocksDB shutdown complete), the diagnostic now also rotates stdout.log and dumps its pre- and post-restart tails in the panic message. Symmetric to the existing stderr.log handling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2121.
Summary
Back-migration (A → B → A) can leave the destination's stored on-chain attestation past expiry by the contract's
current_time_secondsat conclude time. When that happens,reverify_participantsrejectsconclude_node_migrationwithInvalidTeeRemoteAttestation, andretry_conclude_onboardingloops forever on the same stale state — the migration never completes.This PR fixes that by having the destination submit a fresh attestation before entering
retry_conclude_onboarding, and adds contract-sandbox tests that pin both the failure mode and the recovery path.The fix
crates/node/src/migration_service/onboarding.rs::execute_onboardingnow calls a new helpersubmit_attestation_before_concluding_migrationbetween keyshare import andretry_conclude_onboarding. The helper generates a fresh quote viaTeeAuthorityand submits it viasubmit_remote_attestation, which blocks untilTransactionStatus::Executed— so by the timeretry_conclude_onboardingruns, the contract holds a fresh attestation under the destination's TLS key.Helper failure is logged but non-fatal: in the non-back-migration path the existing on-chain attestation is typically still valid, so falling through to
retry_conclude_onboardingis safe.Plumbing:
TeeAuthorityandaccount_public_keyare threadedrun.rs → spawn_recovery_server_and_run_onboarding → onboard → execute_onboarding.Contract-sandbox tests (
crates/contract/tests/sandbox/tee_back_migration.rs)Two paired tests sharing a
setup_stale_back_migrationhelper:conclude_node_migration__rejects_when_destination_attestation_is_stale— reproduces the bug. Submits an attestation withexpiry = now + 5s, fast-forwards 100 blocks, callsconclude_node_migrationdirectly; assertsInvalidTeeRemoteAttestation.conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude— spec-by-test for the fix. Same setup, but the destination submits a fresh attestation for the same TLS key before calling conclude; asserts success. This is exactly the recovery path the new node-side helper drives.Both pass in ~12s.
What's covered vs out of scope
current_time_secondsretry_conclude_onboardingsubmit_attestation_before_concluding_migration__should_submit_one_fresh_attestationTest plan
cargo test -p mpc-node --lib submit_attestation_before_concluding_migration— passes.cargo test -p mpc-contract --test test --profile test-release conclude_node_migration— both contract tests pass.cargo check -p mpc-node --lib --tests— clean.cargo fmt --check— clean.Related
Follow-up
conclude_node_migrationfailure cause in node logs (observability gap uncovered while investigating this bug).