DO NOT REVIEW - Fix CI#104
Conversation
Companion to the bounded shutdown waits: the wait_for_{bitcoind,wallet,zainod}_start
loops were unbounded `while True`, so a process that starts but never finishes
initialization (a stuck sync, or a port still held by an orphan from a prior
aborted run) would spin forever and hang the run until CI's multi-hour limit —
the same failure mode the shutdown bound fixes, on the startup side.
Add a PROC_START_TIMEOUT (default 120s, env-overridable) deadline to all three
startup loops so they fail fast and surface the cause instead of hanging.
Approach grafted from #104 (dannywillems), consolidated here so #102 is the
single home for the process-lifecycle bounding.
Co-Authored-By: Danny Willems <be.danny.willems@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for digging into this @dannywillems — the process-lifecycle hardening here is genuinely valuable, and the per-test analysis of why each test can't run is excellent. I validated things end-to-end; below is the full picture and a suggested way to split the work, since this PR overlaps a couple of others. Treat the breakdown as a proposal — happy to adjust if you see it differently. The CI red looks like an upstream blocker, not anything in this diff. The required shards fail in That's zcash/wallet#454 (fixed by zcash/wallet#455, not yet merged). CI builds stock zallet, so the cache rebuild crashes before any test in the list runs — so disabling tests can't get the rollup green on its own. Nothing on the integration-tests side goes green until #455 lands. This PR is effectively three changes, and two overlap existing PRs, so there's a clean opportunity to split it:
How it fits together: #102 (hardening) + #101 (matrix mechanism) + this PR (the One small thing: the "Fix CI" title undersells the work — maybe "test framework: disable unmigrated RPC tests + migrate mempool_nu_activation to ZebraArgs" once part 3 is split out. |
|
Correction to my comment above — I'd reviewed an earlier push and missed your three latest commits (the NU5-activation work). They change the picture: Retracting the "nothing goes green until zcash/wallet#455" claim. Your The remaining 0/13 is a config asymmetry the NU5 change introduced — and it's a one-line-ish fix. The cache is always built with NU5 ( Reproduced locally: a clean-cache def setup_nodes(self):
args = [
ZebraArgs(
miner_address=(self.miner_addresses[i] if self.miner_addresses else None),
activation_heights=dict(REGTEST_ACTIVATION_HEIGHTS),
)
for i in range(self.num_nodes)
]
return start_nodes(self.num_nodes, self.options.tmpdir, args)(Any test overriding Minor: your bounded startup waits ( The earlier layering suggestion still stands (disable on #101's per-shard mechanism; #102 as the home for the hardening), but your NU5 fix is the headline and should land regardless. |
|
This is a WIP, no need to review for now. |
| '-debug=mempool', | ||
| '-blockmaxsize=4000', | ||
| '-preferredtxversion=4', | ||
| '-allowdeprecated=getnewaddress', |
There was a problem hiding this comment.
I do believe it is expected to not be implemented in Zebra, as Zebra is a consensus node, and not a wallet.
Same argument for:
-allowdeprecated=z_getbalance
-allowdeprecated=z_getnewaddress
| '-checkmempool', | ||
| '-debug=mempool', | ||
| '-blockmaxsize=4000', | ||
| '-preferredtxversion=4', |
There was a problem hiding this comment.
From Claude, by looking at Zebra code.
No config or flag for preferred tx version. Tx versions follow the active network upgrade / protocol rules.
| args = [ | ||
| '-checkmempool', | ||
| '-debug=mempool', | ||
| '-blockmaxsize=4000', |
There was a problem hiding this comment.
From Claude, by looking at Zebra code.
Max block size is the hard-coded consensus constant MAX_BLOCK_BYTES = 2_000_000 (zebra-chain/src/block/serialize.rs:24). Not configurable.
| def setup_network(self): | ||
| args = [ | ||
| '-checkmempool', | ||
| '-debug=mempool', |
There was a problem hiding this comment.
From Claude, by looking at Zebra code.
Zebra uses tracing filters, not -debug= flags. Equivalent: set a tracing filter like zebra_mempool=debug (via the [tracing] config or the --filters arg / RUST_LOG).
|
|
||
| def setup_network(self): | ||
| args = [ | ||
| '-checkmempool', |
There was a problem hiding this comment.
From Claude, by looking at Zebra code.
Zebra has an internal mempool queue checker (zebrad/src/components/mempool/queue_checker.rs) that runs automatically, but it is not a user-facing flag/option.
|
Note: Two tests are failing because of that. |
This test was never migrated to the Z3 stack. It passed a list of zcashd-style args to start_node, which now expects a ZebraArgs dataclass, so it crashed during setup_network with "'list' object has no attribute 'miner_address'". Because it inherits required: true from the Ubuntu 22.04 platform in CI, the failure blocked required runs. Move it out of FLAKY_SCRIPTS into a new, documented DISABLED_SCRIPTS list so it is excluded from the default run until it is migrated. A faithful migration is non-trivial: the test exercises the Blossom, Heartwood, Canopy and NU5 activation boundaries, but zebra regtest hardcodes everything up to Canopy to height 1, and it relies on -blockmaxsize to overflow the mempool, which the zebra regtest config does not expose. Also convert setup_network to pass a ZebraArgs dataclass built from the original activation heights, dropping the zcashd-only flags that have no ZebraArgs equivalent, so the argument-passing crash is fixed regardless.
mempool_packages.py and the four SERIAL_SCRIPTS (mergetoaddress_sapling, mergetoaddress_ua_nu5, mergetoaddress_ua_sapling, wallet_shieldingcoinbase) have the same root cause as mempool_nu_activation.py: they pass lists of zcashd-style args to start_node / start_nodes, which now expect a ZebraArgs dataclass, so they crash during setup_network with "'list' object has no attribute 'miner_address'". They inherit required: true from the platform in CI, so they block required runs. Several also depend on zcashd-only flags with no ZebraArgs equivalent (-regtestshieldcoinbase, -anchorconfirmations, -limitancestorcount), so they cannot be migrated by converting args alone. Move them all into DISABLED_SCRIPTS until they are migrated, and note their original SERIAL_SCRIPTS / FLAKY_SCRIPTS membership so they can be restored.
When zallet (or zebrad/zainod) fails during setup, for example a wallet
that cannot synchronize during cache rebuild ("Missing Orchard tree
state"), the run could hang until the CI job timed out (6 hours) instead
of failing. Two causes:
1. wait_for_bitcoind_start, wait_for_wallet_start and wait_for_zainod_start
looped forever if a process started but never became ready (for example
a port held by an orphaned process). They now give up after
PROCESS_STARTUP_TIMEOUT seconds (default 120, overridable via env) and
raise.
2. On the failure path, main()'s cleanup crashed on stop_wallets(None)
because self.wallets was still None, so the remaining shutdown steps
were skipped. Worse, the processes spawned during cache rebuild are
tracked in module globals (not self.*), so they were never stopped and
lingered, holding ports and the test's stdout pipe open, which is what
hung the job.
Cleanup is now defensive: each RPC stop is guarded and best-effort, and a
new stop_all_processes() helper force-terminates everything still running
in the global process dicts (SIGTERM first, then a bounded wait, then
SIGKILL). wait_bitcoinds and wait_zallets are likewise bounded so a node
that ignores the stop request cannot hang teardown.
The result is that a failed setup now stops and fails quickly rather than
hanging the CI job.
The shared 'current' cache is built by rebuild_cache(), which starts zebrad
and then syncs zallet against it. zebrad was started with only a miner
address and no activation heights, so it used its regtest defaults, which
activate Overwinter through Canopy at height 1 but leave NU5 disabled.
zallet, however, is configured (qa/defaults/zallet/zallet.toml) to activate
NU5 at height 1. During sync zallet asks zebrad for the treestate at the
anchor height, and because zallet believes NU5 is active it requires an
Orchard tree; zebrad, with NU5 inactive, returns none, so zallet fails with
InvalidData { message: "Missing Orchard tree state" }. This broke every test
that builds the cache.
Activate NU5 at height 1 in the zebrad config used for the cache, matching
zallet, via a documented REGTEST_CACHE_ACTIVATION_HEIGHTS constant. Add
cross-referencing comments in both util.py and zallet.toml so the two
configurations stay in sync.
Verified locally: a fresh cache rebuild now mines to height 200 and zallet
syncs without the "Missing Orchard tree state" error.
The cache-rebuild path was fixed to activate NU5 so zallet can sync, but
clean-cache tests that run a wallet through the default setup_nodes() (for
example wallet.py) hit the same mismatch: zebrad started without NU5 while
zallet's regtest config activates NU5 at height 1, so wallet sync failed
with "Missing Orchard tree state".
Generalise the fix: the default setup_nodes() now starts zebrad with
activation_heights={"NU5": 1} to match zallet, and the shared constant is
renamed from REGTEST_CACHE_ACTIVATION_HEIGHTS to REGTEST_ACTIVATION_HEIGHTS
since it is no longer cache-specific. Tests that need different activation
heights continue to override setup_nodes() themselves.
Verified locally: wallet.py now passes and decodescript.py (the other
wallet test on the default setup path) still passes.
With cache_behavior='current'/'fresh', initialize_chain.init_from_cache() copies each cached wallet dir (including wallet.db) into the test dir, but setup_network -> prepare_wallets() -> prepare_wallets_for_mining() then raised "Wallet N already exists, cannot prepare it for mining". This broke every default-setup test that uses the chain cache and a wallet (about 30 tests: converttex.py, getchaintips.py, wallet_accounts.py, etc.). rebuild_cache now records each wallet's mining address in a miner_address.txt inside the wallet dir, which init_from_cache carries into the test dir. prepare_wallets_for_mining reuses an already-present wallet by reading that address instead of re-creating the wallet, so the test mines to the same address the cached chain paid its coinbase to. Freshly-prepared wallets also write the file so they can be captured into a cache. cache_rebuild_required forces a rebuild of older caches that predate the recorded address. Verified locally: converttex.py now passes against a freshly built cache.
The RPC test suite is mid-migration from zcashd to the Z3 stack (zebrad +
zaino + zallet). Most BASE_SCRIPTS are not migrated yet and fail on the new
stack (legacy list args to start_nodes, unimplemented zebra RPCs such as
gettxoutsetinfo and RPC basic-auth, zcashd-only config/log files, and
behavioural assertion mismatches), so the required RPC shards were always
red.
Take the pragmatic approach the suite already documents ("while we are
getting the existing tests to pass, skip tests that are not expected to
pass"): record every currently-failing test in DISABLED_SCRIPTS and subtract
DISABLED_SCRIPTS from the lists that are actually run, matching on the script
filename. The required set is now the migrated, passing subset; re-enable
each test as it is migrated. ALL_SCRIPTS is left complete so a disabled test
can still be run explicitly by name.
Verified against CI run 27016257863: the 103 disabled entries are exactly
the tests that failed there, and the remaining enabled tests are the ones
that passed.
The node no longer has a built-in wallet on the Z3 stack (the wallet is the separate zallet process), so "running a node with -disablewallet" is just a node with no wallet. Drop the custom setup_network and the -disablewallet flag, set num_wallets = 0, and let the default setup_network start a single walletless node. run_test only calls validateaddress, which zebrad implements. Re-enable it in the test runner. Verified locally: disablewallet.py passes.
Ran these disabled tests against the Z3 stack and annotated each with the zcashd feature it still needs: - errors.py, getchaintips.py: zebra missing gettxoutsetinfo / getchaintips - nodehandling.py: zebra missing setban/listbanned/disconnectnode - feature_walletfile.py: zcashd wallet.dat/-wallet, not used by zallet - feature_logging.py: no zcashd-style regtest/debug.log on zebra Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
reorg_limit.py uses no wallet, but defaulted to num_wallets=4, so an unused wallet failed to start and masked the real test. Set num_wallets=0 and activate NU5 at height 1 to match the cache; setup now succeeds. The test still fails (kept disabled): after a network split where node 2 builds a longer chain, reconnecting does not make node 0 reorg to it, and zebra does not reproduce zcashd's deep-reorg safety shutdown. TODO noted in the disabled list. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
a56ba05 to
e13829d
Compare
Add a terse inline note to each DISABLED_SCRIPTS entry: the missing feature it needs (zallet or zebra RPC, pre-NU5 activation height, zcashd-only behaviour, or the P2P/mininode framework), or that it just needs a ZebraArgs migration. Classified from each test's RPC / nuparams / flag usage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…test MISSING-FEATURES.md: source-verified inventory of what keeps each migrated zcashd RPC test disabled, grouped by component (zallet RPCs, zebra node RPCs, regtest activation heights, zcashd-only behaviour, P2P framework), plus the zcashd-deprecation -> zallet account/UA-API migration mapping. rpc-tests.py: each DISABLED_SCRIPTS entry now notes the suggested migration (e.g. getnewaddress -> z_getaddressforaccount) or the genuine gap, rather than just "missing". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Migrating wallet_unified_change.py to the Z3 stack surfaced that zallet z_sendmany cannot spend the harness-mined regtest coinbase: the wallet reports the coinbase as transparent.regular.spendable (~318 ZEC) but total.spendable is 0, and z_sendmany fails with "have 0". The mined coinbase must first be shielded with z_shieldcoinbase. This makes every coinbase-funded wallet test a restructure (add a z_shieldcoinbase funding step plus the account-model rewrite), not a simple RPC rename. Record this in MISSING-FEATURES.md along with the other zallet API specifics found (z_getnewaccount needs a name and returns account_uuid with a null ZIP 32 index, z_sendmany requires a null fee, default UA derivation hits the transparent gap limit), and update the disabled-reason comment for wallet_unified_change.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Trial migration outcome" section to MISSING-FEATURES.md capturing the six structural steps a coinbase-funded wallet test needs on the Z3 stack (z_shieldcoinbase funding, account_uuid identity, z_getbalances reshape, null fee, explicit shielded receivers, self.wallets/miner_address wiring). The trial was stopped before the z_shieldcoinbase funding rewrite; wallet_unified_change.py stays disabled. The steps are the recipe for resuming the wallet-test migration. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This test was never migrated to the Z3 stack. It passed a list of zcashd-style args to start_node, which now expects a ZebraArgs dataclass, so it crashed during setup_network with "'list' object has no attribute 'miner_address'". Because it inherits required: true from the Ubuntu 22.04 platform in CI, the failure blocked required runs.
Move it out of FLAKY_SCRIPTS into a new, documented DISABLED_SCRIPTS list so it is excluded from the default run until it is migrated. A faithful migration is non-trivial: the test exercises the Blossom, Heartwood, Canopy and NU5 activation boundaries, but zebra regtest hardcodes everything up to Canopy to height 1, and it relies on -blockmaxsize to overflow the mempool, which the zebra regtest config does not expose.
Also convert setup_network to pass a ZebraArgs dataclass built from the original activation heights, dropping the zcashd-only flags that have no ZebraArgs equivalent, so the argument-passing crash is fixed regardless.