Skip to content

fix(node): refresh attestation before concluding back-migration (#2121)#3362

Draft
barakeinav1 wants to merge 35 commits into
mainfrom
barak/2121-contract-stale-attestation-test
Draft

fix(node): refresh attestation before concluding back-migration (#2121)#3362
barakeinav1 wants to merge 35 commits into
mainfrom
barak/2121-contract-stale-attestation-test

Conversation

@barakeinav1
Copy link
Copy Markdown
Contributor

@barakeinav1 barakeinav1 commented May 26, 2026

Closes #2121.

Summary

Back-migration (A → B → A) can leave the destination's stored on-chain attestation past expiry by the contract's current_time_seconds at conclude time. When that happens, reverify_participants rejects conclude_node_migration with InvalidTeeRemoteAttestation, and retry_conclude_onboarding loops forever on the same stale state — the migration never completes.

This PR fixes that by having the destination submit a fresh attestation before entering retry_conclude_onboarding, and adds contract-sandbox tests that pin both the failure mode and the recovery path.

The fix

crates/node/src/migration_service/onboarding.rs::execute_onboarding now calls a new helper submit_attestation_before_concluding_migration between keyshare import and retry_conclude_onboarding. The helper generates a fresh quote via TeeAuthority and submits it via submit_remote_attestation, which blocks until TransactionStatus::Executed — so by the time retry_conclude_onboarding runs, the contract holds a fresh attestation under the destination's TLS key.

Helper failure is logged but non-fatal: in the non-back-migration path the existing on-chain attestation is typically still valid, so falling through to retry_conclude_onboarding is safe.

Plumbing: TeeAuthority and account_public_key are threaded run.rs → spawn_recovery_server_and_run_onboarding → onboard → execute_onboarding.

Contract-sandbox tests (crates/contract/tests/sandbox/tee_back_migration.rs)

Two paired tests sharing a setup_stale_back_migration helper:

  1. conclude_node_migration__rejects_when_destination_attestation_is_stale — reproduces the bug. Submits an attestation with expiry = now + 5s, fast-forwards 100 blocks, calls conclude_node_migration directly; asserts InvalidTeeRemoteAttestation.
  2. conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude — spec-by-test for the fix. Same setup, but the destination submits a fresh attestation for the same TLS key before calling conclude; asserts success. This is exactly the recovery path the new node-side helper drives.

Both pass in ~12s.

What's covered vs out of scope

Layer Covered
Contract rejects when destination attestation is past expiry by current_time_seconds ✅ contract test 1
Contract accepts when a fresh attestation is submitted just before conclude ✅ contract test 2
Node submits a fresh attestation before retry_conclude_onboarding ✅ unit test submit_attestation_before_concluding_migration__should_submit_one_fresh_attestation
Mocked Rust E2E A → B → A regression ✅ already merged in #3346
Real-TDX localnet smoke test for A → B → A ✅ already merged in #3338
7-day PCCS-collateral TTL real-TDX repro ❌ operationally impractical; mock substitutes

Test plan

  • cargo test -p mpc-node --lib submit_attestation_before_concluding_migration — passes.
  • cargo test -p mpc-contract --test test --profile test-release conclude_node_migration — both contract tests pass.
  • cargo check -p mpc-node --lib --tests — clean.
  • cargo fmt --check — clean.
  • CI green.

Related

Follow-up

@barakeinav1 barakeinav1 changed the title test(contract): contract-side reproduction of #2121 fix(node): refresh attestation before concluding back-migration (#2121) May 26, 2026
@barakeinav1 barakeinav1 force-pushed the barak/2121-contract-stale-attestation-test branch from f947ff9 to 3a63c91 Compare May 27, 2026 06:51
Adds `crates/contract/tests/sandbox/tee_back_migration.rs` with a single
test that drives `conclude_node_migration` against a stale destination
attestation and asserts the contract rejects with
`InvalidTeeRemoteAttestation`.

Setup mirrors the existing `reshare__should_evict_expired_attestations_via_post_reshare_sweep`
pattern: submit a `MockAttestation::WithConstraints { expiry_timestamp_seconds: now+5 }`,
then `worker.fast_forward()` past the expiry before calling conclude.
This proves the on-chain rejection path #2121 describes is reachable
without needing real TDX hardware or a 7-day wall-clock wait.

Scope: contract-side only. The node-side question — does
`periodic_attestation_submission` fire fast enough on the destination's
restart to close the race window before `retry_conclude_onboarding`
drains its budget — needs a running mpc-node binary and is tracked
separately.

Test passes in ~35s on `cargo test -p mpc-contract --test test --profile test-release`.
Extends `tee_back_migration.rs` with a companion test demonstrating the
fix: if the destination submits a fresh attestation for the same TLS
key before calling `conclude_node_migration`, the contract accepts the
conclude. This is what a fixed node should do in `execute_onboarding`
before entering `retry_conclude_onboarding`.

Shared setup factored into `setup_stale_back_migration` so both tests
diverge only at the final step (one calls conclude directly and expects
rejection; the other submits fresh + conclude, expects success).

Both tests pass in 11.77s total.
Back-migration (A -> B -> A) can leave the destination's stored on-chain
attestation past expiry by the contract's `current_time_seconds` at
conclude time. `reverify_participants` then rejects
`conclude_node_migration` with `InvalidTeeRemoteAttestation`, and
`retry_conclude_onboarding` loops forever on the same stale state.

Adds `submit_attestation_before_concluding_migration` in
`tee::remote_attestation`. `execute_onboarding` calls it between keyshare
import and `retry_conclude_onboarding`; the helper generates a fresh
quote via `TeeAuthority` and submits it via `submit_remote_attestation`,
which waits for `TransactionStatus::Executed`. By the time conclude is
attempted the contract holds a fresh attestation under the destination's
TLS key. Helper failures are logged but non-fatal so the non-back-
migration path is unaffected.

Drives the recovery path pinned by the companion contract-sandbox test
`conclude_node_migration__succeeds_when_destination_submits_fresh_attestation_before_conclude`.

Plumbing: `TeeAuthority` and `account_public_key` are threaded
`run.rs` -> `spawn_recovery_server_and_run_onboarding` -> `onboard`
-> `execute_onboarding`.
`bogus_ed25519_public_key()` already returns `Ed25519PublicKey`; the
trailing `.into()` is a no-op the older local clippy did not catch.
The new attestation-refresh path in execute_onboarding routes through
submit_remote_attestation, which calls send_and_wait. The mock left that
as unimplemented!(), so tests::onboarding::test_onboarding panicked when
the refresh fired during back-migration onboarding.

The mock now forwards the transaction through its existing channel and
returns TransactionStatus::Executed, matching real production behaviour
for the test's purposes.
When `wait_for_node_indexer_height_above` times out we have no visibility
into the underlying cause from CI logs — the mpc-node binary writes to
`home_dir/stderr.log` but nextest doesn't capture per-node logs. PR #3365
already added a "node may have exited" hint in the inner error; the outer
context now dumps the tail of the relevant node's stderr.log so any
panic/abort reason shows up directly in the test panic message.

This is a diagnostic aid for chasing the residual #3366 crash window.
Trade-off: when the wait genuinely just times out (rare under #3365's fix),
the panic message gets ~16KB longer. Worth it.

Adds `MpcNodeState::home_dir()` accessor so the test helper can locate
the file without re-fetching it from each `match` arm.
The check-use-in-fn lint disallows `use` inside fn bodies; lift the
imports added in d970722 out of `read_stderr_tail`.
Captures the full investigation chain that established the back-migration
e2e flake on this PR is an upstream nearcore bug, not a defect in the node
fix. Includes panic excerpts (both modes), the 5/6-vs-0/5 revert experiment
result, and a precise repro shape suitable for an upstream issue.
A trimmed-down sibling of `should_handle_back_migration_a_to_b_to_a`
that triggers the same upstream `near_indexer::streamer` / `near_client::client_actor`
panic with the minimum operational sequence: one forward migration round
(which produces the triggering `SubmitParticipantInfo` -> `ConcludeNodeMigration`
adjacent-block pattern), SIGKILL of the source node, restart, then the
indexer-progress assertion from PR #3365's helper.

Drops the back-migration round, the BackupService recreation, and the
sign+ckd verification at the end — none are needed once the panic
condition is captured. ~70 lines vs ~140 for the full back-migration test.

Currently expected to fail with the documented nearcore panic. Once
nearcore's SIGKILL recovery is fixed, this test should pass and serve
as the permanent regression guard.

See: docs/investigation/2121-back-migration-e2e-flake.md
…tale-attestation-test

# Conflicts:
#	crates/node/src/migration_service.rs
#	crates/node/src/migration_service/onboarding.rs
… activity

Refactors `forward_migration_kill_restart__should_recover_indexer` into
a shared driver and adds 3 sibling variants that differ only in what
sign/CKD activity runs between forward conclude and the SIGKILL:

- variant 0 (existing, renamed comment): no activity      [baseline]
- variant 1: one sign request only
- variant 2: one CKD request only
- variant 3: both — same activity profile as the back-migration test

These deliberately preserve variant 0 as evidence that the forward+kill
shape alone does not surface the upstream nearcore panic. Comparing the
four pass rates will pin down which receipt graph (sign_respond /
respond_ckd / both) is necessary to make the bug reachable, beyond the
two-tx pattern from #3362 that the revert experiment already proved is
necessary.

See: docs/investigation/2121-back-migration-e2e-flake.md
Extends `run_forward_kill_restart_repro` to take separate (pre_sign,
pre_ckd, post_sign, post_ckd) flags and adds three new tests covering
the pre-forward axis of the activity matrix:

- variant 4: pre-forward sign only        [A0-signed sign_respond, no other]
- variant 5: pre-forward ckd only         [A0-signed respond_ckd, no other]
- variant 6: pre-forward sign + ckd       [both, no post-forward activity]

The first 5-run campaign on variants 0-3 showed sign+ckd together
post-forward fires the panic only ~20% (vs ~71% for back-migration).
The remaining structural difference is the back-migration test's
pre-forward sign+ckd block, which puts A0-signed respond receipts
into chain history before A0 is demoted. Variants 4-6 isolate that
ingredient.

See: docs/investigation/2121-back-migration-e2e-flake.md
- Headline statistics now include the 4-variant post-forward matrix
  (campaign 1, 5 runs each): variants 0–2 fire 0%, variant 3 fires ~20%,
  back-migration fires ~71% combined.
- Documents the 8-cell experiment matrix (variants 0–6 + back-migration)
  with the pre/post sign+CKD breakdown.
- Explains the open question for campaign 2 (pre-forward variants 4–6)
  and how its result will narrow the trigger to one of the three
  remaining hypotheses (write-rate / receipt-graph / time-alignment).

No code changes — this is a documentation-only update so the experiment
state can be shared with peers while campaign 2 is still in flight.
Builds on the 7-variant matrix (variants 0-6) to fill out the full 16-cell
(pre-sign, pre-ckd, post-sign, post-ckd) grid plus 2 SIGTERM counterparts:

- variants 7-14: the 8 missing pre+post combinations
- variant 15: full pre + full post (matches back-migration profile up to kill)
- forward variant 15 + SIGTERM: graceful-shutdown equivalent of variant 15
- back-migration + SIGTERM: graceful-shutdown equivalent of the existing
  back-migration test

New infrastructure:
- `NodeStopSignal` enum in e2e-tests::mpc_node with Sigkill / Sigterm variants
- `ProcessGuard::terminate_with_grace` sends SIGTERM via libc::kill, polls
  for graceful exit, falls back to SIGKILL on grace-period timeout
- `MpcCluster::stop_nodes_with` mirrors `kill_nodes` but takes a signal

The two SIGTERM variants provide controlled comparisons: if SIGTERM passes
reliably while SIGKILL flakes, the bug is squarely on nearcore's SIGKILL
recovery path. The full 16-cell matrix lets us isolate exactly which
activity profile reaches the upstream panic.

See: docs/investigation/2121-back-migration-e2e-flake.md
Replaces the partial 4-variant tables with the consolidated results across
all four campaigns. Key updates:

- Variant 15 (full pre+post sign/CKD) is the smoking-gun repro at ~70% —
  same activity profile as the back-migration test (which fires ~65%),
  same panic. The back-migration round itself adds nothing.
- All 4 activity types must be present simultaneously. Drop any one
  (variants 9, 12, 13, 14) and the rate falls to ~10–30% or zero.
- 2-of-4 or fewer activity types almost never fire.
- SIGTERM at 30s grace does NOT trivially fix the bug: 2/2 observations
  of the SIGTERM-15 and SIGTERM-back-migration variants failed with the
  same `streamer/mod.rs:207` panic.
- Receipt-graph hypothesis is the best fit. The SIGTERM-still-fails
  result is a strong hint that the in-memory streamer state (not just
  on-disk RocksDB) is the broken thing.

Follow-ups updated to recommend keeping variant 15 + back-migration as
the canonical repros and probing the SIGTERM result more carefully.
Prunes the 16-variant focused-repro matrix down to the two tests that
matter — variant 15 (`forward_migration_kill_restart_with_pre_both_and_post_both`)
and `should_handle_back_migration_a_to_b_to_a` — and changes both to
**drain** the node before SIGKILLing it.

Mechanism: `drain_node_and_wait_idle` in `common.rs` writes
`listen_blocks.flag = false` (which makes the embedded indexer park at
its cancellation point between block processing) and then polls the
indexer's block-height metric until it stops advancing for a few seconds.
With the indexer provably idle, SIGKILL catches no half-built streamer
state and `neard` can resume cleanly on restart. The activity profile is
unchanged — the drain step is added between post-forward activity and the
kill.

Models production reality: operators do rolling restarts by stopping
new traffic, draining in-flight work, then stopping the process. This
test now does the same.

The deleted variants (0–14 + 2 SIGTERM) served their diagnostic purpose
in identifying the trigger; the matrix data and panic excerpts are
preserved in `docs/investigation/2121-back-migration-e2e-flake.md`.
…pstream

After SIGKILL (~70% fail) and SIGTERM-30s (2/2 fail), the third teardown
strategy — pause `listen_blocks.flag` and wait for our consumer-side
indexer to reach idle, then SIGKILL — also fails 4/4 (5th run in flight).
Failures take 106–127 s instead of ~102 s, confirming the drain ran;
it just doesn't address where the bug lives.

Why drain-via-listen_blocks doesn't help: the flag pauses our consumer
in `crates/node/src/indexer/handler.rs::listen_blocks`. The panic is on
the producer side — nearcore's `near_indexer::streamer::start::{{closure}}`.
We have no test-side mechanism that can drain that task short of waiting
for neard's own graceful shutdown, which SIGTERM-30s also didn't allow
to finish.

Three failed strategies together rule out test-side fixes:
  plain SIGKILL          → ~70% fail
  SIGTERM, 30 s grace    → 2 / 2 fail
  drain consumer + SIGKILL → 4 / 4 fail

This is now solid evidence that the bug must be fixed in nearcore
itself. We can't paper over it from the e2e test without losing the
kill+restart semantics the test exists to model.

Follow-ups updated: opening the upstream issue is now the unblocking
step; decide whether to `#[ignore]` the two remaining repros, use
`reset_and_start_nodes` (papers over but stable), or leave flaky.
…s no SIGTERM handler

Two updates after the 5-run drain-then-kill campaign completed:

1. Drain-then-kill final tally is 5/5 fail (was 4/5 mid-flight) for both
   variant 15 and the back-migration test. Locks the verdict in.

2. Discovered that mpc-node has **no SIGTERM handler at all** — `grep`
   for `tokio::signal` / `SignalKind` / `ctrl_c` / HTTP shutdown turns up
   nothing across the node crate. The internal `shutdown_signal_sender`
   in run.rs:183 exists and is wired into the main tokio::select!, but
   only `monitor_allowed_image_hashes` fires it today. That means all
   our SIGTERM experiments were effectively identical to SIGKILL (default
   OS behavior of SIGTERM = terminate-immediately). Updated the headline,
   the SIGTERM row in the failed-strategies table, and added a new
   follow-up (#5) calling out the graceful-shutdown gap as a real
   production deficiency worth its own issue.
…e hypothesis

Adds `tokio::time::sleep(Duration::from_secs(60))` between the drain step
and SIGKILL at all three kill sites (variant 15, plus A0+B0 in the back-
migration test). Combined with the existing drain, this gives nearcore's
producer-side streamer ~60+ s of inactivity before the kill, testing
whether the upstream panic correlates with how recently neard wrote
state — option A from
`docs/investigation/2121-back-migration-e2e-flake.md`.

Each test now takes ~60-90 s longer; nextest will mark them SLOW but
they should still run. If the panic still fires after the long settle,
that closes the door on "kill-too-close-to-last-activity" as the
mechanism.
Self-contained write-up suitable for filing as a nearcore GitHub issue
once we decide to file it. Covers:

- Both panic modes with full stack traces (streamer/mod.rs:207 and
  client_actor.rs:217)
- Affected versions (2.12.0-rc.1, both aab31b0 and fadb5c1)
- Step-by-step reproduction pointing at this PR's branch + the specific
  triggering commit (1bcbb43, the node-side fix in PR #2121)
- Revert experiment (PR #3373) isolating the trigger to one commit
- What we tried and ruled out: SIGKILL, SIGTERM, drain, settle-time
- Explicit note about SIGTERM — mpc-node has no handler installed, so
  the experiment was effectively identical to SIGKILL; we cannot claim
  to know whether a proper graceful path would prevent the panic, but
  SIGKILL is a legitimate production scenario and shouldn't corrupt
  state unrecoverably regardless
- Hypothesis about cross-block receipt tracking + RocksDB flush
  boundaries
- Links to CI logs showing both panic modes inline

Companion to `2121-back-migration-e2e-flake.md` (full experiment matrix)
— that one is the lab notebook, this one is the bug report.
Earlier wording overstated the reproduction rate. Replaced three uses of
"deterministic" with the actual measured rate (~65–70% per run) so the
upstream readers don't expect a 100% repro.
Adds a real graceful-shutdown path that mirrors what `neard`'s standalone
binary does in its own SIGTERM handler:

1. Catch SIGTERM via `tokio::signal::unix::signal(SignalKind::terminate())`
   and route it into the existing internal `shutdown_signal_sender`
   channel.
2. After the main `tokio::select!` exits via that channel, call
   `near_async::shutdown_all_actors()` to stop the embedded nearcore
   actor system (cancels tokio tasks, signals multithread actors).
3. Then call `near_store::db::RocksDB::block_until_all_instances_are_dropped()`
   so the process doesn't exit before RocksDB has finished closing.

Without this, SIGTERM hit mpc-node with no handler installed and the OS
terminated the process immediately — functionally identical to SIGKILL —
leaving RocksDB in a state that could break recovery on the next start.
The investigation that surfaced this is in
`docs/investigation/nearcore-indexer-sigkill-restart-panic.md`.

Production impact: every dstack/Docker/Kubernetes stop sends SIGTERM
first, then SIGKILLs if the process doesn't exit within the grace
window. With this handler, mpc-node now actually uses that grace window
to flush state.

Test impact: switches the two remaining e2e regression tests
(`should_handle_back_migration_a_to_b_to_a` and
`forward_migration_kill_restart_with_pre_both_and_post_both`) from
SIGKILL to SIGTERM with 60 s grace. If graceful shutdown is enough to
dodge the upstream nearcore restart panic, both tests will now pass.

Removes the now-unused `drain_node_and_wait_idle` helper and its
listen_blocks.flag dance (replaced by real graceful shutdown).
…tale-attestation-test

# Conflicts:
#	Cargo.lock
#	crates/node/Cargo.toml
`cargo fmt --check` caught the extra blank line in e2e-tests/tests/common.rs
between `current_node_indexer_height` and the next helper. This was a
leftover from removing the drain helper in the SIGTERM-handler commit.
`cargo make check-all-fast` runs `cargo doc` with
`-D rustdoc::broken-intra-doc-links`. The intra-doc link to `kill_nodes`
on `MpcCluster::stop_nodes_with`'s comment didn't resolve in scope.
Drop the brackets so it's a normal code-formatted reference instead.
…_resharing regression

`cancellation_of_resharing::test_cancellation_of_resharing` is now
reproducibly failing on this branch (24s, WASM trap on the
`vote_cancel_resharing` call) while it passes on main. Our PR adds no
contract code changes vs main, so the difference must come from test
scheduling — nextest orders tests alphabetically and we added variant 15
(`forward_migration_kill_restart_with_pre_both_and_post_both`) which
puts it ahead of `cancellation_of_resharing` in the order. With the
runner's max-threads of 4 and variant 15's heavy migration/sandbox
workload, the concurrent setup pressure during the early test window
may be racing the resharing state machine.

Marking variant 15 `#[ignore]` for one CI run to test the hypothesis.
If `cancellation_of_resharing` passes again with variant 15 disabled,
the regression is purely test-ordering interference; we'll then need
to either keep variant 15 ignored, restructure it to not race, or
adjust the test parallelism config.
When resolving the conflict during the merge from origin/main, I used
`cargo generate-lockfile` which resolved newer versions of several
near-sdk dependencies than main has — added new entries for
near-crypto-hash, near-global-contracts, near-sdk-core, near-sdk-env.

The contract's WASM gets built against this lock file. The newer
near-sdk versions produced different runtime behavior, which broke
`cancellation_of_resharing::test_cancellation_of_resharing` (it
WASM-trapped on `vote_cancel_resharing` instead of returning the
expected `NotParticipant` error). Three CI runs reproduced this.

Replaced with `git checkout origin/main -- Cargo.lock` so we inherit
main's resolved versions verbatim, then `cargo build -p mpc-node`
populated just the entries needed for my new `near-store` dep.
Resulting diff vs main is 20 lines, all in the `near-store`
neighborhood.
After the SIGTERM-handler fix removed the upstream nearcore restart
panic, the back-migration test reached a different failure point:
target's debug endpoint still showed destination_node_info.tls_public_key=None
when the 30s INDEXER_SYNC_TIMEOUT elapsed. The contract-view check
two retries earlier already passed, so the migration TX is on chain —
target just hadn't propagated it to its local debug view yet.

Test runs last (#23/23) with max concurrency = 4, so machines are loaded.
Doubling the timeout to 60s tests whether this is a tight-timing CI
artifact or a real post-restart coordinator-state bug.
Three changes to make each back-migration test run definitively
classifiable instead of 'failed somewhere':

1. Rotate stderr.log → stderr.log.previous on restart. File::create
   was truncating the pre-restart logs, so SIGTERM-handler messages
   (SIGTERM received, RocksDB shutdown complete, etc.) were gone
   before the diagnostic ever read the file.

2. wait_for_node_indexer_height_above now dumps both tails on
   failure. With this, the panic message will show pre-restart
   shutdown activity AND post-restart panic, side by side.

3. terminate_with_grace eprintln!s either 'exited gracefully Xms
   after SIGTERM' or 'SIGTERM grace period elapsed; falling back
   to SIGKILL'. nextest reliably captures eprintln in failure
   output (tracing::warn may not), so we can tell when grace
   timed out vs when the handler completed cleanly.

Together these let us classify each failing run into one of four
categories rather than guessing.
5-run campaign showed our SIGTERM handler hangs past the 60s grace
period in 5/5 runs. Diagnostics pinpointed the hang: our embedded
indexer runs in a std::thread::spawn'd closure whose block_on never
returns because the spawned monitor_* tasks hold Arc<IndexerState>
references that nothing cancels on shutdown. So
block_until_all_instances_are_dropped() loops forever waiting for
the RocksDB ref count to hit zero. The test harness then SIGKILLs us
and we land in the exact uncommitted-RocksDB state the handler was
supposed to prevent — functionally identical to no handler at all.

We still call shutdown_all_actors() so nearcore tasks can commit any
in-flight RocksDB batches, but we no longer wait for refs to drop.
RocksDB's WAL still guarantees committed data survives kill; this
just closes a smaller flush window. The previously-introduced
near-store dep is now unused, removed.

A proper fix would wire a CancellationToken through the indexer
thread (option B in the recent investigation discussion). This
change is option A — minimal, lets the campaign actually exercise
the handler instead of always tripping SIGKILL fallback.
…tdown question

The earlier docs had to hedge: SIGTERM experiments couldn't distinguish
'graceful shutdown doesn't help' from 'we never had a SIGTERM handler.'
We've now closed that gap by landing a real SIGTERM handler (#3409 /
#3410) and re-running the back-migration campaign.

- 2121-back-migration-e2e-flake.md: new section 'Real SIGTERM handler in
  mpc-node — also does not fix it' covering iteration 1 (handler hangs
  on block_until_all_instances_are_dropped, 5/5 SIGKILL fallback) and
  iteration 2 (drop the block_until call, 1/5 pass with 100 ms graceful
  shutdowns in 5/5). Follow-up #5 updated — in flight via #3409 / #3410.
- nearcore-indexer-sigkill-restart-panic.md: new 'Strongest evidence:
  graceful shutdown doesn't help' subsection inside 'What we ruled out';
  table updated with the 5-run SIGTERM-handler row (4/5 fail at 100 ms);
  TL;DR + Workarounds + References updated; old SIGTERM-disclaimer
  removed since the question is now answered.

Together the docs now say: same panic, same rate, regardless of whether
shutdown was SIGKILL or a verified 100 ms graceful actor-system stop.
The fix has to be in nearcore.
…n gracefully

Earlier wording claimed 'graceful shutdown does not prevent the panic.'
A nearcore reviewer could rightly push back: 'you didn't actually let
the indexer shut down — its tokio runtime keeps running until the
process exits.' Both docs now state this explicitly:

- The 100 ms 'graceful' covers mpc-node's main runtime + a stop signal
  into nearcore's ACTOR_SYSTEMS registry. It does NOT cover the
  embedded indexer thread, whose ~6 spawned monitor tasks hold
  Arc<IndexerState> -> Arc<RocksDB> refs that nothing cancels.
- The std::thread::spawn'd indexer closure is terminated by OS-level
  process exit at ~100 ms, mid-flight, like SIGKILL.
- We still believe the bug is upstream (3 reasons: panic reads
  pre-shutdown state, WAL already guarantees committed-data
  consistency, recovery shouldn't unwrap regardless), but we're
  honest that we haven't empirically tested 'fully drain the indexer
  thread first' (option B).

TL;DR adjusted to say 'reasonably available to the embedder' and to
foreground the recovery-path argument rather than the shutdown-timing
argument. Workarounds and table-row wording updated for consistency.
… RocksDB drop

Iteration 2 of the SIGTERM handler left the indexer thread running:
its tokio runtime hosted spawned monitor tasks holding Arc<RocksDB>
refs that nothing cancelled, so we couldn't call
block_until_all_instances_are_dropped without it hanging forever. We
worked around that by skipping the block_until call entirely — clean
for mpc-node main, indexer thread killed mid-flight on process exit.

This change actually drains the indexer thread:

- spawn_real_indexer takes a CancellationToken parameter.
- The terminal listen_blocks .await is wrapped in tokio::select! racing
  the token's cancelled() future.
- When the token cancels: listen_blocks branch aborts, block_on
  returns, the std::thread::spawn closure ends, the indexer's tokio
  runtime drops, and every tokio::spawn'd monitor task (each holding
  Arc<IndexerState> -> Arc<RocksDB>) is aborted with the runtime.
  Their Drop impls release the Arc refs.
- In run.rs the shutdown sequence becomes: shutdown_all_actors() (give
  nearcore actors a chance to commit) -> indexer_shutdown_token.cancel()
  (release indexer thread refs) -> block_until_all_instances_are_dropped()
  (now safe, will return promptly).

This is the minimum-viable form of option B: rather than threading the
token through every monitor function, rely on runtime-drop to abort all
spawned tasks. Substantially smaller diff and same behavior.

Whether this changes the upstream nearcore restart-panic rate is the
next question — campaign of 5 runs on this commit will answer it.
The kill+restart diagnostic only rotated stderr.log and dumped its
tails. That captured panics but missed every tracing::info!/warn!/error!
emitted by mpc-node, because tracing-subscriber writes to STDOUT by
default. To prove option B is actually executing its shutdown path
(SIGTERM received -> Stopping nearcore actor system -> Cancelling
indexer shutdown token -> Indexer thread received shutdown signal ->
Waiting for RocksDB instances -> RocksDB shutdown complete), the
diagnostic now also rotates stdout.log and dumps its pre- and
post-restart tails in the panic message.

Symmetric to the existing stderr.log handling.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: enable migration of key-shares into a populated keyshare storage

1 participant