fix(node): handle SIGTERM for graceful shutdown on operator stop#3410
fix(node): handle SIGTERM for graceful shutdown on operator stop#3410barakeinav1 wants to merge 1 commit into
Conversation
mpc-node had no SIGTERM handler installed, so SIGTERM from container/orchestrator stop (dstack CVM stop, docker stop, kubectl delete, systemctl stop) was effectively SIGKILL — the OS terminated the process immediately, leaving the embedded near-indexer thread to be killed mid-write. The next start could then trip the nearcore restart panic documented in docs/investigation/. This installs a SIGTERM handler that routes into the existing shutdown_signal channel, so the same select! arm that handles TEE image-hash-initiated shutdowns also handles SIGTERM. After the main loop exits we call near_async::shutdown_all_actors() so nearcore's actor system can commit any in-flight RocksDB batches before we return. We deliberately do NOT call RocksDB::block_until_all_instances_are_dropped() (which neard's standalone binary does next) — our embedded indexer runs in a std::thread::spawn'd closure whose block_on never returns because spawned monitor tasks hold Arc<IndexerState> → Arc<RocksDB> refs that nothing cancels. Calling block_until_all_instances_are_dropped would hang the SIGTERM path past any reasonable grace period and end in the SIGKILL we were trying to avoid. RocksDB's WAL still guarantees committed data survives kill; this just closes a smaller flush window. See the inline comment. Closes #3409. Does NOT close the upstream nearcore restart panic at streamer/mod.rs:207 which fires non-deterministically regardless of shutdown cleanliness — see docs/investigation/nearcore-indexer-sigkill-restart-panic.md.
There was a problem hiding this comment.
Pull request overview
Installs a Unix SIGTERM handler in mpc-node that forwards SIGTERM into the existing internal shutdown channel so operator-initiated stops (Docker/Kubernetes/systemd/dstack) get the same graceful path used by TEE image-hash-driven shutdowns, and calls near_async::shutdown_all_actors() after the main select! exits to give nearcore's actor system a chance to flush in-flight RocksDB batches.
Changes:
- Spawn a task on
root_runtimethat installs a SIGTERM handler and sends()onshutdown_signal_senderwhen received. - After graceful shutdown, invoke
near_async::shutdown_all_actors()(with an inline rationale for skippingRocksDB::block_until_all_instances_are_dropped()).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pull request overviewAdds a SIGTERM handler in Changes:
Reviewed changesPer-file summary
FindingsBlocking (must fix before merge):
Non-blocking (nits, follow-ups, suggestions):
✅ Approved |
…tdown question The earlier docs had to hedge: SIGTERM experiments couldn't distinguish 'graceful shutdown doesn't help' from 'we never had a SIGTERM handler.' We've now closed that gap by landing a real SIGTERM handler (#3409 / #3410) and re-running the back-migration campaign. - 2121-back-migration-e2e-flake.md: new section 'Real SIGTERM handler in mpc-node — also does not fix it' covering iteration 1 (handler hangs on block_until_all_instances_are_dropped, 5/5 SIGKILL fallback) and iteration 2 (drop the block_until call, 1/5 pass with 100 ms graceful shutdowns in 5/5). Follow-up #5 updated — in flight via #3409 / #3410. - nearcore-indexer-sigkill-restart-panic.md: new 'Strongest evidence: graceful shutdown doesn't help' subsection inside 'What we ruled out'; table updated with the 5-run SIGTERM-handler row (4/5 fail at 100 ms); TL;DR + Workarounds + References updated; old SIGTERM-disclaimer removed since the question is now answered. Together the docs now say: same panic, same rate, regardless of whether shutdown was SIGKILL or a verified 100 ms graceful actor-system stop. The fix has to be in nearcore.
Summary
mpc-nodethat routes the signal into the existing internal shutdown channel — operators stopping the node via dstack CVM stop /docker stop/kubectl delete/systemctl stopnow get graceful shutdown semantics, where before SIGTERM was effectively SIGKILL.select!exits, callsnear_async::shutdown_all_actors()so nearcore's actor system can commit any in-flight RocksDB batches before the process exits.Closes #3409. Does NOT close the upstream nearcore restart panic at
streamer/mod.rs:207— seedocs/investigation/nearcore-indexer-sigkill-restart-panic.md. Test campaign data on the investigation branch showed this handler gets back-migration e2e from 0/5 pass to 1/5 — a real production improvement, but not a full fix for the upstream non-determinism.Notable non-action: no
RocksDB::block_until_all_instances_are_dropped()neard's standalone binary follows
shutdown_all_actors()withRocksDB::block_until_all_instances_are_dropped(). We deliberately don't, because in our embedding it hangs indefinitely: the indexer runs in astd::thread::spawn'd closure whoseblock_onnever returns — the spawnedmonitor_*/indexer_loggertasks holdArc<IndexerState>→Arc<RocksDB>references that nothing currently cancels on shutdown. Callingblock_until_all_instances_are_droppedwould block the SIGTERM path past any reasonable grace period and the orchestrator would SIGKILL us anyway. A proper fix would wire aCancellationTokenthrough the indexer thread; out of scope for this PR. The inline comment captures the rationale.Test plan
mpc-nodelocally against localnet, sendkill -TERM <pid>, observeSIGTERM received, initiating graceful shutdownlog followed byStopping nearcore actor system.and a clean exit (typically <1s).should_handle_back_migration_a_to_b_to_aflake is pre-existing and tracks the upstream bug, not this handler).