You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
e2e-tests::e2e migration_service::migration_service__should_handle_back_migration_a_to_b_to_a (added in #3346) has been flaking on CI. Three confirmed failures with identical signature (Connection refused on http://127.0.0.1:21725/debug/migrations at migration_service.rs:533:29):
Root cause: cluster.wait_for_node_healthy() only checks that /health returns 200 OK. /health is || async { "OK" } (crates/node/src/web.rs) — bound very early in node startup, before the indexer initializes. After the kill+restart in the back-migration test, the function returns in as little as 16 ms (the time to spawn the binary and bind a socket), with the indexer still warming up. The test then marches into the back-migration's polling loop on /debug/migrations, and if A0's process exits during catch-up (likely panicking on the contract state after the forward migration's identity swap), the polling sits on Connection refused for the full 30 s INDEXER_SYNC_TIMEOUT.
User Story
As a developer, I want migration_service__should_handle_back_migration_a_to_b_to_a to pass deterministically so flaky CI doesn't block unrelated PRs.
Acceptance Criteria
The test passes on at least 15 consecutive CI runs on the same branch.
The fix is targeted at the readiness-gap (don't proceed before the indexer is making progress), not a band-aid on the polling timeout.
If the secondary "A0 crashes during catch-up" theory ever materializes after this fix, the new helper's error message will say "indexer block-height metric not available — node may have exited" (instead of the silent 30 s Connection refused timeout). A future change to capture each MPC node's stderr.log into the test output on failure would close the diagnostic loop for that case.
Background
e2e-tests::e2e migration_service::migration_service__should_handle_back_migration_a_to_b_to_a(added in #3346) has been flaking on CI. Three confirmed failures with identical signature (Connection refusedonhttp://127.0.0.1:21725/debug/migrationsatmigration_service.rs:533:29):Root cause:
cluster.wait_for_node_healthy()only checks that/healthreturns200 OK./healthis|| async { "OK" }(crates/node/src/web.rs) — bound very early in node startup, before the indexer initializes. After the kill+restart in the back-migration test, the function returns in as little as 16 ms (the time to spawn the binary and bind a socket), with the indexer still warming up. The test then marches into the back-migration's polling loop on/debug/migrations, and if A0's process exits during catch-up (likely panicking on the contract state after the forward migration's identity swap), the polling sits onConnection refusedfor the full 30 sINDEXER_SYNC_TIMEOUT.User Story
As a developer, I want
migration_service__should_handle_back_migration_a_to_b_to_ato pass deterministically so flaky CI doesn't block unrelated PRs.Acceptance Criteria
Resources & Additional Notes
Connection refusedtimeout). A future change to capture each MPC node'sstderr.loginto the test output on failure would close the diagnostic loop for that case.