Skip to content

chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge)#3373

Draft
barakeinav1 wants to merge 8 commits into
mainfrom
barak/2121-flake-experiment-revert-node-fix
Draft

chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge)#3373
barakeinav1 wants to merge 8 commits into
mainfrom
barak/2121-flake-experiment-revert-node-fix

Conversation

@barakeinav1
Copy link
Copy Markdown
Contributor

Not for merge. This is a CI experiment branch for PR #3362's flake investigation.

Same branch as #3362 (rebased on main, includes #3365's test fix, contract sandbox tests, stderr-tail diagnostic) but with one commit reverted: the node-side `submit_attestation_before_concluding_migration` change.

Hypothesis being tested

PR #3365 ran 15/15 green pre-merge. PR #3362 on top of merged main fails 5/6. The bug surfaces deterministically only with #3362's node-side fix on the branch. If the failure is caused by the specific TX pattern #3362 introduces (`SubmitParticipantInfo` immediately before `ConcludeNodeMigration`, same signer, adjacent blocks), reverting just that change should make E2E pass.

Possible outcomes

  • E2E passes consistently → the specific TX pattern is the trigger. Use that as repro material for the upstream nearcore issue.
  • E2E still fails → the extra-block-depth hypothesis is also wrong; trigger is something else (e.g. test infrastructure interaction with my stderr-dump on failure, or pure timing).

What gets closed after the experiment

This PR. The findings go back into PR #3362's diagnostic doc.

Adds `crates/contract/tests/sandbox/tee_back_migration.rs` with a single
test that drives `conclude_node_migration` against a stale destination
attestation and asserts the contract rejects with
`InvalidTeeRemoteAttestation`.

Setup mirrors the existing `reshare__should_evict_expired_attestations_via_post_reshare_sweep`
pattern: submit a `MockAttestation::WithConstraints { expiry_timestamp_seconds: now+5 }`,
then `worker.fast_forward()` past the expiry before calling conclude.
This proves the on-chain rejection path #2121 describes is reachable
without needing real TDX hardware or a 7-day wall-clock wait.

Scope: contract-side only. The node-side question — does
`periodic_attestation_submission` fire fast enough on the destination's
restart to close the race window before `retry_conclude_onboarding`
drains its budget — needs a running mpc-node binary and is tracked
separately.

Test passes in ~35s on `cargo test -p mpc-contract --test test --profile test-release`.
Extends `tee_back_migration.rs` with a companion test demonstrating the
fix: if the destination submits a fresh attestation for the same TLS
key before calling `conclude_node_migration`, the contract accepts the
conclude. This is what a fixed node should do in `execute_onboarding`
before entering `retry_conclude_onboarding`.

Shared setup factored into `setup_stale_back_migration` so both tests
diverge only at the final step (one calls conclude directly and expects
rejection; the other submits fresh + conclude, expects success).

Both tests pass in 11.77s total.
Back-migration (A -> B -> A) can leave the destination's stored on-chain
attestation past expiry by the contract's `current_time_seconds` at
conclude time. `reverify_participants` then rejects
`conclude_node_migration` with `InvalidTeeRemoteAttestation`, and
`retry_conclude_onboarding` loops forever on the same stale state.

Adds `submit_attestation_before_concluding_migration` in
`tee::remote_attestation`. `execute_onboarding` calls it between keyshare
import and `retry_conclude_onboarding`; the helper generates a fresh
quote via `TeeAuthority` and submits it via `submit_remote_attestation`,
which waits for `TransactionStatus::Executed`. By the time conclude is
attempted the contract holds a fresh attestation under the destination's
TLS key. Helper failures are logged but non-fatal so the non-back-
migration path is unaffected.

Drives the recovery path pinned by the companion contract-sandbox test
`conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`.

Plumbing: `TeeAuthority` and `account_public_key` are threaded
`run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard`
-> `execute_onboarding`.
`bogus_ed25519_public_key()` already returns `Ed25519PublicKey`; the
trailing `.into()` is a no-op the older local clippy did not catch.
The new attestation-refresh path in execute_onboarding routes through
submit_remote_attestation, which calls send_and_wait. The mock left that
as unimplemented!(), so tests::onboarding::test_onboarding panicked when
the refresh fired during back-migration onboarding.

The mock now forwards the transaction through its existing channel and
returns TransactionStatus::Executed, matching real production behaviour
for the test's purposes.
When `wait_for_node_indexer_height_above` times out we have no visibility
into the underlying cause from CI logs — the mpc-node binary writes to
`home_dir/stderr.log` but nextest doesn't capture per-node logs. PR #3365
already added a "node may have exited" hint in the inner error; the outer
context now dumps the tail of the relevant node's stderr.log so any
panic/abort reason shows up directly in the test panic message.

This is a diagnostic aid for chasing the residual #3366 crash window.
Trade-off: when the wait genuinely just times out (rare under #3365's fix),
the panic message gets ~16KB longer. Worth it.

Adds `MpcNodeState::home_dir()` accessor so the test helper can locate
the file without re-fetching it from each `match` arm.
The check-use-in-fn lint disallows `use` inside fn bodies; lift the
imports added in d970722 out of `read_stderr_tail`.
@barakeinav1 barakeinav1 changed the title DO NOT MERGE: experiment — revert #2121 node fix to isolate CI flake cause chore: experiment — revert #2121 node fix to isolate CI flake cause (do not merge) May 27, 2026
barakeinav1 added a commit that referenced this pull request May 28, 2026
Self-contained write-up suitable for filing as a nearcore GitHub issue
once we decide to file it. Covers:

- Both panic modes with full stack traces (streamer/mod.rs:207 and
  client_actor.rs:217)
- Affected versions (2.12.0-rc.1, both aab31b0 and fadb5c1)
- Step-by-step reproduction pointing at this PR's branch + the specific
  triggering commit (1bcbb43, the node-side fix in PR #2121)
- Revert experiment (PR #3373) isolating the trigger to one commit
- What we tried and ruled out: SIGKILL, SIGTERM, drain, settle-time
- Explicit note about SIGTERM — mpc-node has no handler installed, so
  the experiment was effectively identical to SIGKILL; we cannot claim
  to know whether a proper graceful path would prevent the panic, but
  SIGKILL is a legitimate production scenario and shouldn't corrupt
  state unrecoverably regardless
- Hypothesis about cross-block receipt tracking + RocksDB flush
  boundaries
- Links to CI logs showing both panic modes inline

Companion to `2121-back-migration-e2e-flake.md` (full experiment matrix)
— that one is the lab notebook, this one is the bug report.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant