Skip to content

fix: fix empty block-nodes.json race in parallel deploy#4647

Open
JeffreyDallas wants to merge 16 commits into
mainfrom
fix/block-nodes-json-empty-parallel-deploy
Open

fix: fix empty block-nodes.json race in parallel deploy#4647
JeffreyDallas wants to merge 16 commits into
mainfrom
fix/block-nodes-json-empty-parallel-deploy

Conversation

@JeffreyDallas

Copy link
Copy Markdown
Contributor

Summary

Fixes #4644

When parallelDeploy: true, the one-shot orchestrator runs Deploy block node and Deploy network node (including node setup) concurrently. node setup's updateBlockNodesJson() was reading blockNodeMap before block-node add's updateConsensusNodesInRemoteConfig() had populated it, so block-nodes.json landed on the consensus node as {"nodes":[]}. The node started without any block node connections.

Root cause (confirmed from solo.log):

  • block-node add helm-installs the block node (~20–30 s) and only calls updateConsensusNodesInRemoteConfig() after the pod is healthy
  • Concurrently, node setup starts, loads remoteConfig, and runs updateBlockNodesJson() ~8 s before the block node pod is even ready
  • block-node add skips re-copying block-nodes.json because ledgerPhase === UNINITIALIZED
  • Result: permanent empty block-nodes.json on the node

Fix — event-gate approach (no change to parallelDeploy):

  • New BlockNodeDeployed event type + BlockNodeDeployedEvent class
  • block-node add emits the event after handleConsensusNodeUpdating() finishes (blockNodeMap persisted)
  • The skip callback in the one-shot orchestrator emits the event immediately when block-node deploy is skipped (not needed / already deployed), so the gate resolves from history and there is zero extra latency on sequential or retry paths
  • withWaitCondition(SoloEventType.BlockNodeDeployed, 10 min) added to the "Setup consensus node" phase — updateBlockNodesJson() now only runs after the blockNodeMap is guaranteed populated

Also adds BlockNodesJsonEmptySoloError (SOLO-5074) as an early-fail guard in createAndCopyBlockNodeJsonFileForConsensusNode: if the generated nodes array is empty the function throws immediately with actionable troubleshooting steps instead of silently deploying broken config.

Test plan

  • Run task test-e2e-standard — existing one-shot test suite
  • Run one-shot with parallelDeploy: true and confirm block-nodes.json contains node entries on the consensus node
  • Run one-shot with parallelDeploy: false (sequential) and confirm no regression / no extra wait time
  • Retry scenario: run deploy → destroy → deploy again and confirm block-nodes.json is correct on second deploy

🤖 Generated with Claude Code

…deploy

When parallelDeploy is true, the one-shot orchestrator runs
'Deploy block node' and 'Deploy network node' (including node setup)
concurrently. node setup's updateBlockNodesJson() was reading blockNodeMap
before block-node add's updateConsensusNodesInRemoteConfig() had a chance
to populate it, resulting in block-nodes.json being written as {"nodes":[]}.

Fix:
- Add BlockNodeDeployed event type and BlockNodeDeployedEvent class
- Emit BlockNodeDeployedEvent from block-node add after
  handleConsensusNodeUpdating() persists the updated blockNodeMap
- Emit the event from the skip callback when block-node deploy is
  skipped (not needed or already deployed), so the wait never hangs
- Add withWaitCondition(BlockNodeDeployed) to the "Setup consensus node"
  phase so updateBlockNodesJson() only runs after the blockNodeMap is
  populated; waitFor checks history so sequential mode has zero extra latency

Also add BlockNodesJsonEmptySoloError (SOLO-5074) as an early-fail
guard in createAndCopyBlockNodeJsonFileForConsensusNode so any future
empty-nodes case surfaces as a clear error instead of silent bad config.

Fixes #4644

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas requested a review from a team as a code owner June 12, 2026 20:33
@trunk-io

trunk-io Bot commented Jun 12, 2026

Copy link
Copy Markdown

❌ This pull request failed tests. It has been removed from the merge queue. PR #4737 was used for testing. See more details here.

Failed Required Status Conclusion
Windows Deploy and Destroy Failure
  • To merge this pull request, check the box to the left or comment /trunk merge below.

After your PR is submitted to the merge queue, this comment will be automatically updated with its status. If the PR fails, failure details will also be posted here

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Linux

38 tests   38 ✅  0s ⏱️
17 suites   0 💤
 1 files     0 ❌

Results for commit a7967cc.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Windows

    1 files    337 suites   10s ⏱️
1 070 tests 1 070 ✅ 0 💤 0 ❌
1 074 runs  1 074 ✅ 0 💤 0 ❌

Results for commit a7967cc.

♻️ This comment has been updated with latest results.

@jan-milenkov jan-milenkov added PR: Needs Team Approval A pull request that needs review from a team member. PR: Needs Manager Approval A pull request that needs review from a manager. PR: Checks Failed A pull request where the checks have failed. and removed PR: Needs Team Approval A pull request that needs review from a team member. PR: Needs Manager Approval A pull request that needs review from a manager. labels Jun 15, 2026
JeffreyDallas and others added 4 commits June 15, 2026 11:30
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
The early-fail guard in createAndCopyBlockNodeJsonFileForConsensusNode
was too strict: it fired during block node destroy when
rebuildBlockNodesJsonForConsensusNodes() intentionally writes an empty
file after the last block node is removed from blockNodeMap.

Add allowEmpty parameter (default false) so callers in the destroy
path can opt out of the guard while all deploy/setup callers still
get the fail-fast protection.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@github-actions

Copy link
Copy Markdown
Contributor

E2E Test Report

 10 files  ±0   94 suites  ±0   1h 26m 14s ⏱️ + 1m 28s
301 tests ±0  301 ✅ ±0  0 💤 ±0  0 ❌ ±0 
320 runs  ±0  320 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 90b231a. ± Comparison against base commit 3487dc2.

@github-actions

github-actions Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

E2E Test Report

 10 files   94 suites   1h 24m 24s ⏱️
302 tests 302 ✅ 0 💤 0 ❌
321 runs  321 ✅ 0 💤 0 ❌

Results for commit a7967cc.

♻️ This comment has been updated with latest results.

@JeffreyDallas JeffreyDallas added PR: Needs Team Approval A pull request that needs review from a team member. P1-💎 Current Milestone & Goals and removed PR: Checks Failed A pull request where the checks have failed. labels Jun 15, 2026
@JeffreyDallas JeffreyDallas changed the title fix(one-shot): fix empty block-nodes.json race in parallel deploy fix: fix empty block-nodes.json race in parallel deploy Jun 15, 2026
jan-milenkov
jan-milenkov previously approved these changes Jun 16, 2026

@jan-milenkov jan-milenkov left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this has negative effect on performance, inform managers before proceeding with merge

@jan-milenkov jan-milenkov added PR: Unresolved Comments A pull request where there are comments and they need to be resolved. and removed PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 16, 2026
The "Copy block-nodes.json" step inside `consensus network deploy` reads
blockNodeMap from the consensus node state. In parallel one-shot deploy,
block-node add registers the component before it populates blockNodeMap,
creating a window where consensus network deploy sees a non-empty
blockNodeComponents list but an empty blockNodeMap — causing the guard
to throw BlockNodesJsonEmptySoloError.

Move the withWaitCondition(BlockNodeDeployed) gate from "Setup consensus
node" to "Deploy consensus node" so consensus network deploy only starts
after block-node add has fully persisted blockNodeMap. Since block-node
add (~30s) completes well before consensus network deploy finishes,
there is no real latency cost.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas

Copy link
Copy Markdown
Contributor Author

If this has negative effect on performance, inform managers before proceeding with merge

no negative effect on oneshot timing or performance memory foot print

@JeffreyDallas JeffreyDallas removed the PR: Unresolved Comments A pull request where there are comments and they need to be resolved. label Jun 16, 2026
@JeffreyDallas JeffreyDallas removed the PR: Checks Failed A pull request where the checks have failed. label Jun 17, 2026
@jan-milenkov jan-milenkov changed the title fix: fix empty block-nodes.json race in parallel deploy fix: fix empty block-nodes.json race in parallel deploy Jun 18, 2026
jan-milenkov
jan-milenkov previously approved these changes Jun 18, 2026
@jan-milenkov jan-milenkov added the PR: Ready to Merge A pull request that is ready to merge. label Jun 18, 2026
@jeromy-cannon jeromy-cannon marked this pull request as draft June 19, 2026 15:49
@jeromy-cannon jeromy-cannon added PR: Merge Conflicts A pull request that has merge conflicts that need to be resolved. and removed PR: Needs Team Approval A pull request that needs review from a team member. PR: Ready to Merge A pull request that is ready to merge. labels Jun 19, 2026
…k-nodes-json-empty-parallel-deploy

Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>

# Conflicts:
#	src/core/helpers.ts
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas removed the PR: Merge Conflicts A pull request that has merge conflicts that need to be resolved. label Jun 22, 2026
@JeffreyDallas JeffreyDallas marked this pull request as ready for review June 22, 2026 20:29
@JeffreyDallas JeffreyDallas added the PR: Needs Team Approval A pull request that needs review from a team member. label Jun 23, 2026
@jeromy-cannon jeromy-cannon marked this pull request as draft June 23, 2026 06:05
@jeromy-cannon jeromy-cannon added PR: Merge Conflicts A pull request that has merge conflicts that need to be resolved. and removed PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 23, 2026
JeffreyDallas and others added 2 commits June 23, 2026 08:54
…k-nodes-json-empty-parallel-deploy

Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>

# Conflicts:
#	src/commands/block-node.ts
#	src/commands/one-shot/orchestrator/deploy/default-one-shot-deploy-orchestrator.ts
#	src/core/helpers.ts
@JeffreyDallas JeffreyDallas removed the PR: Merge Conflicts A pull request that has merge conflicts that need to be resolved. label Jun 23, 2026
@JeffreyDallas JeffreyDallas marked this pull request as ready for review June 23, 2026 14:46
@jeromy-cannon jeromy-cannon added the PR: Checks Failed A pull request where the checks have failed. label Jun 23, 2026
@jeromy-cannon jeromy-cannon marked this pull request as draft June 23, 2026 15:29
@JeffreyDallas JeffreyDallas marked this pull request as ready for review June 23, 2026 16:16
JeffreyDallas and others added 2 commits June 23, 2026 11:16
The relay JSON-RPC pod running under 250Mi is prone to OOM under
Windows/WSL2 CI resource pressure. An OOM kill triggers a restart that
creates memory/CPU churn, which delays mirror-node postgres initialisation
and causes the mirror-grpc readiness check to time out.

Raise the request to 350Mi and the limit to 700Mi to prevent OOM and
give the relay enough headroom to start cleanly alongside the other
one-shot components.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas added Blocked Further development work is blocked by other item and removed PR: Checks Failed A pull request where the checks have failed. labels Jun 23, 2026
@JeffreyDallas

Copy link
Copy Markdown
Contributor Author

Wait #4798 to fix migration test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Blocked Further development work is blocked by other item P1-💎 Current Milestone & Goals

Projects

None yet

Development

Successfully merging this pull request may close these issues.

oneshot failed get relay pod ready

3 participants