chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge) by barakeinav1 · Pull Request #3373 · near/mpc

barakeinav1 · 2026-05-27T10:16:24Z

Not for merge. This is a CI experiment branch for PR #3362's flake investigation.

Same branch as #3362 (rebased on main, includes #3365's test fix, contract sandbox tests, stderr-tail diagnostic) but with one commit reverted: the node-side `submit_attestation_before_concluding_migration` change.

Hypothesis being tested

PR #3365 ran 15/15 green pre-merge. PR #3362 on top of merged main fails 5/6. The bug surfaces deterministically only with #3362's node-side fix on the branch. If the failure is caused by the specific TX pattern #3362 introduces (`SubmitParticipantInfo` immediately before `ConcludeNodeMigration`, same signer, adjacent blocks), reverting just that change should make E2E pass.

Possible outcomes

E2E passes consistently → the specific TX pattern is the trigger. Use that as repro material for the upstream nearcore issue.
E2E still fails → the extra-block-depth hypothesis is also wrong; trigger is something else (e.g. test infrastructure interaction with my stderr-dump on failure, or pure timing).

What gets closed after the experiment

This PR. The findings go back into PR #3362's diagnostic doc.

Adds `crates/contract/tests/sandbox/tee_back_migration.rs` with a single test that drives `conclude_node_migration` against a stale destination attestation and asserts the contract rejects with `InvalidTeeRemoteAttestation`. Setup mirrors the existing `reshare__should_evict_expired_attestations_via_post_reshare_sweep` pattern: submit a `MockAttestation::WithConstraints { expiry_timestamp_seconds: now+5 }`, then `worker.fast_forward()` past the expiry before calling conclude. This proves the on-chain rejection path #2121 describes is reachable without needing real TDX hardware or a 7-day wall-clock wait. Scope: contract-side only. The node-side question — does `periodic_attestation_submission` fire fast enough on the destination's restart to close the race window before `retry_conclude_onboarding` drains its budget — needs a running mpc-node binary and is tracked separately. Test passes in ~35s on `cargo test -p mpc-contract --test test --profile test-release`.

Extends `tee_back_migration.rs` with a companion test demonstrating the fix: if the destination submits a fresh attestation for the same TLS key before calling `conclude_node_migration`, the contract accepts the conclude. This is what a fixed node should do in `execute_onboarding` before entering `retry_conclude_onboarding`. Shared setup factored into `setup_stale_back_migration` so both tests diverge only at the final step (one calls conclude directly and expects rejection; the other submits fresh + conclude, expects success). Both tests pass in 11.77s total.

Back-migration (A -> B -> A) can leave the destination's stored on-chain attestation past expiry by the contract's `current_time_seconds` at conclude time. `reverify_participants` then rejects `conclude_node_migration` with `InvalidTeeRemoteAttestation`, and `retry_conclude_onboarding` loops forever on the same stale state. Adds `submit_attestation_before_concluding_migration` in `tee::remote_attestation`. `execute_onboarding` calls it between keyshare import and `retry_conclude_onboarding`; the helper generates a fresh quote via `TeeAuthority` and submits it via `submit_remote_attestation`, which waits for `TransactionStatus::Executed`. By the time conclude is attempted the contract holds a fresh attestation under the destination's TLS key. Helper failures are logged but non-fatal so the non-back- migration path is unaffected. Drives the recovery path pinned by the companion contract-sandbox test `conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`. Plumbing: `TeeAuthority` and `account_public_key` are threaded `run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard` -> `execute_onboarding`.

`bogus_ed25519_public_key()` already returns `Ed25519PublicKey`; the trailing `.into()` is a no-op the older local clippy did not catch.

The new attestation-refresh path in execute_onboarding routes through submit_remote_attestation, which calls send_and_wait. The mock left that as unimplemented!(), so tests::onboarding::test_onboarding panicked when the refresh fired during back-migration onboarding. The mock now forwards the transaction through its existing channel and returns TransactionStatus::Executed, matching real production behaviour for the test's purposes.

When `wait_for_node_indexer_height_above` times out we have no visibility into the underlying cause from CI logs — the mpc-node binary writes to `home_dir/stderr.log` but nextest doesn't capture per-node logs. PR #3365 already added a "node may have exited" hint in the inner error; the outer context now dumps the tail of the relevant node's stderr.log so any panic/abort reason shows up directly in the test panic message. This is a diagnostic aid for chasing the residual #3366 crash window. Trade-off: when the wait genuinely just times out (rare under #3365's fix), the panic message gets ~16KB longer. Worth it. Adds `MpcNodeState::home_dir()` accessor so the test helper can locate the file without re-fetching it from each `match` arm.

The check-use-in-fn lint disallows `use` inside fn bodies; lift the imports added in d970722 out of `read_stderr_tail`.

…on (#2121)" This reverts commit 1bcbb43.

Self-contained write-up suitable for filing as a nearcore GitHub issue once we decide to file it. Covers: - Both panic modes with full stack traces (streamer/mod.rs:207 and client_actor.rs:217) - Affected versions (2.12.0-rc.1, both aab31b0 and fadb5c1) - Step-by-step reproduction pointing at this PR's branch + the specific triggering commit (1bcbb43, the node-side fix in PR #2121) - Revert experiment (PR #3373) isolating the trigger to one commit - What we tried and ruled out: SIGKILL, SIGTERM, drain, settle-time - Explicit note about SIGTERM — mpc-node has no handler installed, so the experiment was effectively identical to SIGKILL; we cannot claim to know whether a proper graceful path would prevent the panic, but SIGKILL is a legitimate production scenario and shouldn't corrupt state unrecoverably regardless - Hypothesis about cross-block receipt tracking + RocksDB flush boundaries - Links to CI logs showing both panic modes inline Companion to `2121-back-migration-e2e-flake.md` (full experiment matrix) — that one is the lab notebook, this one is the bug report.

barakeinav1 added 8 commits May 27, 2026 09:00

ci: drop useless .into() flagged by clippy 1.93

dd2d62a

`bogus_ed25519_public_key()` already returns `Ed25519PublicKey`; the trailing `.into()` is a no-op the older local clippy did not catch.

ci: move use std::io::* to module scope in common.rs

cc3eeae

The check-use-in-fn lint disallows `use` inside fn bodies; lift the imports added in d970722 out of `read_stderr_tail`.

Revert "fix(node): refresh attestation before concluding back-migrati…

2def2c1

…on (#2121)" This reverts commit 1bcbb43.

barakeinav1 changed the title ~~DO NOT MERGE: experiment — revert #2121 node fix to isolate CI flake cause~~ chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge) May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge)#3373

chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge)#3373
barakeinav1 wants to merge 8 commits into
mainfrom
barak/2121-flake-experiment-revert-node-fix

barakeinav1 commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

barakeinav1 commented May 27, 2026

Hypothesis being tested

Possible outcomes

What gets closed after the experiment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant