fix: fix empty block-nodes.json race in parallel deploy#4647
fix: fix empty block-nodes.json race in parallel deploy#4647JeffreyDallas wants to merge 16 commits into
block-nodes.json race in parallel deploy#4647Conversation
…deploy
When parallelDeploy is true, the one-shot orchestrator runs
'Deploy block node' and 'Deploy network node' (including node setup)
concurrently. node setup's updateBlockNodesJson() was reading blockNodeMap
before block-node add's updateConsensusNodesInRemoteConfig() had a chance
to populate it, resulting in block-nodes.json being written as {"nodes":[]}.
Fix:
- Add BlockNodeDeployed event type and BlockNodeDeployedEvent class
- Emit BlockNodeDeployedEvent from block-node add after
handleConsensusNodeUpdating() persists the updated blockNodeMap
- Emit the event from the skip callback when block-node deploy is
skipped (not needed or already deployed), so the wait never hangs
- Add withWaitCondition(BlockNodeDeployed) to the "Setup consensus node"
phase so updateBlockNodesJson() only runs after the blockNodeMap is
populated; waitFor checks history so sequential mode has zero extra latency
Also add BlockNodesJsonEmptySoloError (SOLO-5074) as an early-fail
guard in createAndCopyBlockNodeJsonFileForConsensusNode so any future
empty-nodes case surfaces as a clear error instead of silent bad config.
Fixes #4644
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
|
❌ This pull request failed tests. It has been removed from the merge queue. PR #4737 was used for testing. See more details here.
After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here |
Unit Test Results - Linux38 tests 38 ✅ 0s ⏱️ Results for commit a7967cc. ♻️ This comment has been updated with latest results. |
Unit Test Results - Windows 1 files 337 suites 10s ⏱️ Results for commit a7967cc. ♻️ This comment has been updated with latest results. |
The early-fail guard in createAndCopyBlockNodeJsonFileForConsensusNode was too strict: it fired during block node destroy when rebuildBlockNodesJsonForConsensusNodes() intentionally writes an empty file after the last block node is removed from blockNodeMap. Add allowEmpty parameter (default false) so callers in the destroy path can opt out of the guard while all deploy/setup callers still get the fail-fast protection. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
E2E Test Report 10 files 94 suites 1h 24m 24s ⏱️ Results for commit a7967cc. ♻️ This comment has been updated with latest results. |
jan-milenkov
left a comment
There was a problem hiding this comment.
If this has negative effect on performance, inform managers before proceeding with merge
The "Copy block-nodes.json" step inside `consensus network deploy` reads blockNodeMap from the consensus node state. In parallel one-shot deploy, block-node add registers the component before it populates blockNodeMap, creating a window where consensus network deploy sees a non-empty blockNodeComponents list but an empty blockNodeMap — causing the guard to throw BlockNodesJsonEmptySoloError. Move the withWaitCondition(BlockNodeDeployed) gate from "Setup consensus node" to "Deploy consensus node" so consensus network deploy only starts after block-node add has fully persisted blockNodeMap. Since block-node add (~30s) completes well before consensus network deploy finishes, there is no real latency cost. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
no negative effect on oneshot timing or performance memory foot print |
block-nodes.json race in parallel deploy
…k-nodes-json-empty-parallel-deploy Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com> # Conflicts: # src/core/helpers.ts
…k-nodes-json-empty-parallel-deploy Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com> # Conflicts: # src/commands/block-node.ts # src/commands/one-shot/orchestrator/deploy/default-one-shot-deploy-orchestrator.ts # src/core/helpers.ts
The relay JSON-RPC pod running under 250Mi is prone to OOM under Windows/WSL2 CI resource pressure. An OOM kill triggers a restart that creates memory/CPU churn, which delays mirror-node postgres initialisation and causes the mirror-grpc readiness check to time out. Raise the request to 350Mi and the limit to 700Mi to prevent OOM and give the relay enough headroom to start cleanly alongside the other one-shot components. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
|
Wait #4798 to fix migration test |
Summary
Fixes #4644
When
parallelDeploy: true, the one-shot orchestrator runsDeploy block nodeandDeploy network node(includingnode setup) concurrently.node setup'supdateBlockNodesJson()was readingblockNodeMapbeforeblock-node add'supdateConsensusNodesInRemoteConfig()had populated it, soblock-nodes.jsonlanded on the consensus node as{"nodes":[]}. The node started without any block node connections.Root cause (confirmed from
solo.log):block-node addhelm-installs the block node (~20–30 s) and only callsupdateConsensusNodesInRemoteConfig()after the pod is healthynode setupstarts, loadsremoteConfig, and runsupdateBlockNodesJson()~8 s before the block node pod is even readyblock-node addskips re-copyingblock-nodes.jsonbecauseledgerPhase === UNINITIALIZEDblock-nodes.jsonon the nodeFix — event-gate approach (no change to
parallelDeploy):BlockNodeDeployedevent type +BlockNodeDeployedEventclassblock-node addemits the event afterhandleConsensusNodeUpdating()finishes (blockNodeMap persisted)withWaitCondition(SoloEventType.BlockNodeDeployed, 10 min)added to the "Setup consensus node" phase —updateBlockNodesJson()now only runs after the blockNodeMap is guaranteed populatedAlso adds
BlockNodesJsonEmptySoloError(SOLO-5074) as an early-fail guard increateAndCopyBlockNodeJsonFileForConsensusNode: if the generatednodesarray is empty the function throws immediately with actionable troubleshooting steps instead of silently deploying broken config.Test plan
task test-e2e-standard— existing one-shot test suiteparallelDeploy: trueand confirmblock-nodes.jsoncontains node entries on the consensus nodeparallelDeploy: false(sequential) and confirm no regression / no extra wait time🤖 Generated with Claude Code