chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge)#3373
Draft
barakeinav1 wants to merge 8 commits into
Draft
chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge)#3373barakeinav1 wants to merge 8 commits into
barakeinav1 wants to merge 8 commits into
Conversation
Adds `crates/contract/tests/sandbox/tee_back_migration.rs` with a single
test that drives `conclude_node_migration` against a stale destination
attestation and asserts the contract rejects with
`InvalidTeeRemoteAttestation`.
Setup mirrors the existing `reshare__should_evict_expired_attestations_via_post_reshare_sweep`
pattern: submit a `MockAttestation::WithConstraints { expiry_timestamp_seconds: now+5 }`,
then `worker.fast_forward()` past the expiry before calling conclude.
This proves the on-chain rejection path #2121 describes is reachable
without needing real TDX hardware or a 7-day wall-clock wait.
Scope: contract-side only. The node-side question — does
`periodic_attestation_submission` fire fast enough on the destination's
restart to close the race window before `retry_conclude_onboarding`
drains its budget — needs a running mpc-node binary and is tracked
separately.
Test passes in ~35s on `cargo test -p mpc-contract --test test --profile test-release`.
Extends `tee_back_migration.rs` with a companion test demonstrating the fix: if the destination submits a fresh attestation for the same TLS key before calling `conclude_node_migration`, the contract accepts the conclude. This is what a fixed node should do in `execute_onboarding` before entering `retry_conclude_onboarding`. Shared setup factored into `setup_stale_back_migration` so both tests diverge only at the final step (one calls conclude directly and expects rejection; the other submits fresh + conclude, expects success). Both tests pass in 11.77s total.
Back-migration (A -> B -> A) can leave the destination's stored on-chain attestation past expiry by the contract's `current_time_seconds` at conclude time. `reverify_participants` then rejects `conclude_node_migration` with `InvalidTeeRemoteAttestation`, and `retry_conclude_onboarding` loops forever on the same stale state. Adds `submit_attestation_before_concluding_migration` in `tee::remote_attestation`. `execute_onboarding` calls it between keyshare import and `retry_conclude_onboarding`; the helper generates a fresh quote via `TeeAuthority` and submits it via `submit_remote_attestation`, which waits for `TransactionStatus::Executed`. By the time conclude is attempted the contract holds a fresh attestation under the destination's TLS key. Helper failures are logged but non-fatal so the non-back- migration path is unaffected. Drives the recovery path pinned by the companion contract-sandbox test `conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`. Plumbing: `TeeAuthority` and `account_public_key` are threaded `run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard` -> `execute_onboarding`.
`bogus_ed25519_public_key()` already returns `Ed25519PublicKey`; the trailing `.into()` is a no-op the older local clippy did not catch.
The new attestation-refresh path in execute_onboarding routes through submit_remote_attestation, which calls send_and_wait. The mock left that as unimplemented!(), so tests::onboarding::test_onboarding panicked when the refresh fired during back-migration onboarding. The mock now forwards the transaction through its existing channel and returns TransactionStatus::Executed, matching real production behaviour for the test's purposes.
When `wait_for_node_indexer_height_above` times out we have no visibility into the underlying cause from CI logs — the mpc-node binary writes to `home_dir/stderr.log` but nextest doesn't capture per-node logs. PR #3365 already added a "node may have exited" hint in the inner error; the outer context now dumps the tail of the relevant node's stderr.log so any panic/abort reason shows up directly in the test panic message. This is a diagnostic aid for chasing the residual #3366 crash window. Trade-off: when the wait genuinely just times out (rare under #3365's fix), the panic message gets ~16KB longer. Worth it. Adds `MpcNodeState::home_dir()` accessor so the test helper can locate the file without re-fetching it from each `match` arm.
The check-use-in-fn lint disallows `use` inside fn bodies; lift the imports added in d970722 out of `read_stderr_tail`.
barakeinav1
added a commit
that referenced
this pull request
May 28, 2026
Self-contained write-up suitable for filing as a nearcore GitHub issue once we decide to file it. Covers: - Both panic modes with full stack traces (streamer/mod.rs:207 and client_actor.rs:217) - Affected versions (2.12.0-rc.1, both aab31b0 and fadb5c1) - Step-by-step reproduction pointing at this PR's branch + the specific triggering commit (1bcbb43, the node-side fix in PR #2121) - Revert experiment (PR #3373) isolating the trigger to one commit - What we tried and ruled out: SIGKILL, SIGTERM, drain, settle-time - Explicit note about SIGTERM — mpc-node has no handler installed, so the experiment was effectively identical to SIGKILL; we cannot claim to know whether a proper graceful path would prevent the panic, but SIGKILL is a legitimate production scenario and shouldn't corrupt state unrecoverably regardless - Hypothesis about cross-block receipt tracking + RocksDB flush boundaries - Links to CI logs showing both panic modes inline Companion to `2121-back-migration-e2e-flake.md` (full experiment matrix) — that one is the lab notebook, this one is the bug report.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Not for merge. This is a CI experiment branch for PR #3362's flake investigation.
Same branch as #3362 (rebased on main, includes #3365's test fix, contract sandbox tests, stderr-tail diagnostic) but with one commit reverted: the node-side `submit_attestation_before_concluding_migration` change.
Hypothesis being tested
PR #3365 ran 15/15 green pre-merge. PR #3362 on top of merged main fails 5/6. The bug surfaces deterministically only with #3362's node-side fix on the branch. If the failure is caused by the specific TX pattern #3362 introduces (`SubmitParticipantInfo` immediately before `ConcludeNodeMigration`, same signer, adjacent blocks), reverting just that change should make E2E pass.
Possible outcomes
What gets closed after the experiment
This PR. The findings go back into PR #3362's diagnostic doc.