Background
mpc-node does not install a SIGTERM handler today. When an operator stops the process via docker stop, kubectl delete, systemctl stop, or dstack's CVM stop command, the orchestrator sends SIGTERM first and then SIGKILL after a grace period (10s for Docker default, 30s for Kubernetes, 90s for systemd). Because we have no handler installed, SIGTERM has the same effect as SIGKILL — the OS terminates the process immediately, the embedded near-indexer thread is killed mid-write, and the next start can land on inconsistent RocksDB state.
This issue was surfaced while investigating docs/investigation/2121-back-migration-e2e-flake.md, where a CI test SIGKILLs mpc-node mid-flight and the next start panics ~65–80% of the time inside near-indexer (see docs/investigation/nearcore-indexer-sigkill-restart-panic.md). Production stops via dstack/Docker/Kubernetes/systemd take the same code path, so any production stop today carries the same restart-corruption risk as the test scenario.
User Story
As an operator stopping an MPC node via my orchestrator (dstack CVM stop / Docker / Kubernetes / systemd), I want the node to receive SIGTERM, finish in-flight commits, and exit cleanly within the grace period — so that the next start finds RocksDB in a consistent state and doesn't trip an indexer restart panic.
Acceptance Criteria
Resources & Additional Notes
- The investigation that surfaced this issue and the test campaign data live in
docs/investigation/2121-back-migration-e2e-flake.md. With a working handler, the e2e test passed 1/5 vs 0/5 without it on the same commit — confirming the handler is a real production improvement but does not fully close the upstream nearcore restart panic, which fires non-deterministically regardless of shutdown cleanliness.
- The upstream nearcore bug is documented separately in
docs/investigation/nearcore-indexer-sigkill-restart-panic.md. That issue needs to be fixed in nearcore; this issue is the orthogonal mpc-node-side fix that should land regardless.
- We considered also calling
near_store::db::RocksDB::block_until_all_instances_are_dropped() (which neard's standalone binary does after shutdown_all_actors), but it hangs indefinitely in our embedding because our indexer thread's block_on never returns — the spawned monitor tasks hold Arc<IndexerState> → Arc<RocksDB> references that nothing currently cancels on shutdown. A proper fix for that hang would wire a CancellationToken through the indexer thread; out of scope for this issue.
Background
mpc-nodedoes not install a SIGTERM handler today. When an operator stops the process viadocker stop,kubectl delete,systemctl stop, or dstack's CVM stop command, the orchestrator sends SIGTERM first and then SIGKILL after a grace period (10s for Docker default, 30s for Kubernetes, 90s for systemd). Because we have no handler installed, SIGTERM has the same effect as SIGKILL — the OS terminates the process immediately, the embeddednear-indexerthread is killed mid-write, and the next start can land on inconsistent RocksDB state.This issue was surfaced while investigating
docs/investigation/2121-back-migration-e2e-flake.md, where a CI test SIGKILLsmpc-nodemid-flight and the next start panics ~65–80% of the time insidenear-indexer(seedocs/investigation/nearcore-indexer-sigkill-restart-panic.md). Production stops via dstack/Docker/Kubernetes/systemd take the same code path, so any production stop today carries the same restart-corruption risk as the test scenario.User Story
As an operator stopping an MPC node via my orchestrator (dstack CVM stop / Docker / Kubernetes / systemd), I want the node to receive SIGTERM, finish in-flight commits, and exit cleanly within the grace period — so that the next start finds RocksDB in a consistent state and doesn't trip an indexer restart panic.
Acceptance Criteria
mpc-nodeinstalls a SIGTERM handler that routes the signal into the existing internal shutdown channel (shutdown_signal_sender), so the sametokio::select!arm that handles TEE image-hash shutdowns also handles SIGTERM.select!exits,near_async::shutdown_all_actors()is called so nearcore's actor system can commit any in-flight RocksDB batches before the process exits.tracing::warn!("SIGTERM received, initiating graceful shutdown")is emitted when the signal arrives, so operators can confirm the path was taken.mpc-nodeexits gracefully (typically within 100 ms) after SIGTERM, vs SIGKILL fallback firing in the pre-fix state.Resources & Additional Notes
docs/investigation/2121-back-migration-e2e-flake.md. With a working handler, the e2e test passed 1/5 vs 0/5 without it on the same commit — confirming the handler is a real production improvement but does not fully close the upstream nearcore restart panic, which fires non-deterministically regardless of shutdown cleanliness.docs/investigation/nearcore-indexer-sigkill-restart-panic.md. That issue needs to be fixed in nearcore; this issue is the orthogonal mpc-node-side fix that should land regardless.near_store::db::RocksDB::block_until_all_instances_are_dropped()(which neard's standalone binary does aftershutdown_all_actors), but it hangs indefinitely in our embedding because our indexer thread'sblock_onnever returns — the spawned monitor tasks holdArc<IndexerState>→Arc<RocksDB>references that nothing currently cancels on shutdown. A proper fix for that hang would wire aCancellationTokenthrough the indexer thread; out of scope for this issue.