mpc-node: handle SIGTERM for graceful shutdown on operator stop

### Background

`mpc-node` does not install a SIGTERM handler today. When an operator stops the process via `docker stop`, `kubectl delete`, `systemctl stop`, or dstack's CVM stop command, the orchestrator sends SIGTERM first and then SIGKILL after a grace period (10s for Docker default, 30s for Kubernetes, 90s for systemd). Because we have no handler installed, **SIGTERM has the same effect as SIGKILL** — the OS terminates the process immediately, the embedded `near-indexer` thread is killed mid-write, and the next start can land on inconsistent RocksDB state.

This issue was surfaced while investigating [`docs/investigation/2121-back-migration-e2e-flake.md`](https://github.com/near/mpc/blob/main/docs/investigation/2121-back-migration-e2e-flake.md), where a CI test SIGKILLs `mpc-node` mid-flight and the next start panics ~65–80% of the time inside `near-indexer` (see [`docs/investigation/nearcore-indexer-sigkill-restart-panic.md`](https://github.com/near/mpc/blob/main/docs/investigation/nearcore-indexer-sigkill-restart-panic.md)). Production stops via dstack/Docker/Kubernetes/systemd take the same code path, so any production stop today carries the same restart-corruption risk as the test scenario.

### User Story

As an operator stopping an MPC node via my orchestrator (dstack CVM stop / Docker / Kubernetes / systemd), I want the node to receive SIGTERM, finish in-flight commits, and exit cleanly within the grace period — so that the next start finds RocksDB in a consistent state and doesn't trip an indexer restart panic.

### Acceptance Criteria

- [ ] `mpc-node` installs a SIGTERM handler that routes the signal into the existing internal shutdown channel (`shutdown_signal_sender`), so the same `tokio::select!` arm that handles TEE image-hash shutdowns also handles SIGTERM.
- [ ] After the main `select!` exits, `near_async::shutdown_all_actors()` is called so nearcore's actor system can commit any in-flight RocksDB batches before the process exits.
- [ ] `tracing::warn!("SIGTERM received, initiating graceful shutdown")` is emitted when the signal arrives, so operators can confirm the path was taken.
- [ ] Verified in CI: `mpc-node` exits gracefully (typically within 100 ms) after SIGTERM, vs SIGKILL fallback firing in the pre-fix state.

### Resources & Additional Notes

- The investigation that surfaced this issue and the test campaign data live in [`docs/investigation/2121-back-migration-e2e-flake.md`](https://github.com/near/mpc/blob/main/docs/investigation/2121-back-migration-e2e-flake.md). With a working handler, the e2e test passed 1/5 vs 0/5 without it on the same commit — confirming the handler is a real production improvement but **does not fully close the upstream nearcore restart panic**, which fires non-deterministically regardless of shutdown cleanliness.
- The upstream nearcore bug is documented separately in [`docs/investigation/nearcore-indexer-sigkill-restart-panic.md`](https://github.com/near/mpc/blob/main/docs/investigation/nearcore-indexer-sigkill-restart-panic.md). That issue needs to be fixed in nearcore; this issue is the orthogonal mpc-node-side fix that should land regardless.
- We considered also calling `near_store::db::RocksDB::block_until_all_instances_are_dropped()` (which neard's standalone binary does after `shutdown_all_actors`), but it **hangs indefinitely** in our embedding because our indexer thread's `block_on` never returns — the spawned monitor tasks hold `Arc<IndexerState>` → `Arc<RocksDB>` references that nothing currently cancels on shutdown. A proper fix for that hang would wire a `CancellationToken` through the indexer thread; out of scope for this issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mpc-node: handle SIGTERM for graceful shutdown on operator stop #3409

Background

User Story

Acceptance Criteria

Resources & Additional Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

mpc-node: handle SIGTERM for graceful shutdown on operator stop #3409

Description

Background

User Story

Acceptance Criteria

Resources & Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions