Skip to content

DO NOT REVIEW - Fix CI#104

Draft
dannywillems wants to merge 15 commits into
mainfrom
fix/disable-mempool-nu-activation
Draft

DO NOT REVIEW - Fix CI#104
dannywillems wants to merge 15 commits into
mainfrom
fix/disable-mempool-nu-activation

Conversation

@dannywillems

Copy link
Copy Markdown
Contributor

This test was never migrated to the Z3 stack. It passed a list of zcashd-style args to start_node, which now expects a ZebraArgs dataclass, so it crashed during setup_network with "'list' object has no attribute 'miner_address'". Because it inherits required: true from the Ubuntu 22.04 platform in CI, the failure blocked required runs.

Move it out of FLAKY_SCRIPTS into a new, documented DISABLED_SCRIPTS list so it is excluded from the default run until it is migrated. A faithful migration is non-trivial: the test exercises the Blossom, Heartwood, Canopy and NU5 activation boundaries, but zebra regtest hardcodes everything up to Canopy to height 1, and it relies on -blockmaxsize to overflow the mempool, which the zebra regtest config does not expose.

Also convert setup_network to pass a ZebraArgs dataclass built from the original activation heights, dropping the zcashd-only flags that have no ZebraArgs equivalent, so the argument-passing crash is fixed regardless.

@dannywillems dannywillems changed the title Disable mempool_nu_activation.py and convert its args to ZebraArgs Fix CI Jun 5, 2026
oxarbitrage added a commit that referenced this pull request Jun 5, 2026
Companion to the bounded shutdown waits: the wait_for_{bitcoind,wallet,zainod}_start
loops were unbounded `while True`, so a process that starts but never finishes
initialization (a stuck sync, or a port still held by an orphan from a prior
aborted run) would spin forever and hang the run until CI's multi-hour limit —
the same failure mode the shutdown bound fixes, on the startup side.

Add a PROC_START_TIMEOUT (default 120s, env-overridable) deadline to all three
startup loops so they fail fast and surface the cause instead of hanging.

Approach grafted from #104 (dannywillems), consolidated here so #102 is the
single home for the process-lifecycle bounding.

Co-Authored-By: Danny Willems <be.danny.willems@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@oxarbitrage

Copy link
Copy Markdown
Contributor

Thanks for digging into this @dannywillems — the process-lifecycle hardening here is genuinely valuable, and the per-test analysis of why each test can't run is excellent. I validated things end-to-end; below is the full picture and a suggested way to split the work, since this PR overlaps a couple of others. Treat the breakdown as a proposal — happy to adjust if you see it differently.

The CI red looks like an upstream blocker, not anything in this diff. The required shards fail in create_cache.pyrebuild_cache() with:

Failed to synchronize zallet: InvalidData { message: "Missing Orchard tree state" }

That's zcash/wallet#454 (fixed by zcash/wallet#455, not yet merged). CI builds stock zallet, so the cache rebuild crashes before any test in the list runs — so disabling tests can't get the rollup green on its own. Nothing on the integration-tests side goes green until #455 lands.

This PR is effectively three changes, and two overlap existing PRs, so there's a clean opportunity to split it:

  1. Process-lifecycle hardening (PROCESS_STARTUP_TIMEOUT, bounded waits, stop_all_processes) overlaps test framework: bound process-shutdown waits to avoid multi-hour hangs #102 — and it's the most valuable piece: it turns the #454 crash from a 6-hour hang into a fast failure (ci: make SERIAL/FLAKY RPC shards non-required #101's run hit the 6h ceiling vs. ~5s with the bound). Your startup-wait timeout was the bit test framework: bound process-shutdown waits to avoid multi-hour hangs #102 was missing, so I've grafted that idea into test framework: bound process-shutdown waits to avoid multi-hour hangs #102 and credited you as co-author on that commit. Suggest dropping part 3 here and letting test framework: bound process-shutdown waits to avoid multi-hour hangs #102 carry it.
  2. Disabling the broken tests (DISABLED_SCRIPTS) overlaps ci: make SERIAL/FLAKY RPC shards non-required #101, but they layer nicely: ci: make SERIAL/FLAKY RPC shards non-required #101 fixes the underlying matrix bug (the required flag is stamped per-platform instead of per-shard) and is the home for genuinely-flaky quarantines, while DISABLED_SCRIPTS is the right tool for the six deterministically-broken (not flaky) tests. Suggest rebasing the disable on top of ci: make SERIAL/FLAKY RPC shards non-required #101's per-shard mechanism rather than emptying SERIAL_SCRIPTS/FLAKY_SCRIPTS wholesale.
  3. ZebraArgs migration of mempool_nu_activation.py — 👍 great as-is.

How it fits together: #102 (hardening) + #101 (matrix mechanism) + this PR (the DISABLED_SCRIPTS taxonomy + ZebraArgs migration), coordinated rather than competing — #102 landing first so the others fail fast instead of hanging, and everything able to go green once #455 merges.

One small thing: the "Fix CI" title undersells the work — maybe "test framework: disable unmigrated RPC tests + migrate mempool_nu_activation to ZebraArgs" once part 3 is split out.

@dannywillems dannywillems marked this pull request as draft June 5, 2026 12:42
@oxarbitrage

Copy link
Copy Markdown
Contributor

Correction to my comment above — I'd reviewed an earlier push and missed your three latest commits (the NU5-activation work). They change the picture:

Retracting the "nothing goes green until zcash/wallet#455" claim. Your REGTEST_ACTIVATION_HEIGHTS = {"NU5": 1} is the better root-cause: the Missing Orchard tree state crash came from zebrad starting without NU5 while zallet.toml activates NU5 at height 1 — they disagreed, so zebra returned no Orchard treestate. Aligning them makes the cache rebuild succeed without the upstream #455 fix. Nice catch — that was a harness misconfiguration we'd been reading as a pure upstream blocker.

The remaining 0/13 is a config asymmetry the NU5 change introduced — and it's a one-line-ish fix. The cache is always built with NU5 (initialize_chain/rebuild_cache), but setup_nodes() only passes NU5 when miner_addresses is set, which only happens for num_wallets > 0. So every non-wallet test starts zebra with args = None (no NU5) and then loads the NU5-built cache → the cached chain's history root doesn't validate:

failed to commit non-finalized block from backup directory
  commit_error=InvalidBlockCommitment(InvalidChainHistoryRoot { ... }) height=Height(97)

Reproduced locally: a clean-cache httpbasics.py run fails with exactly this, while the cache itself builds fine (no Missing Orchard tree state), confirming it's the cache-vs-node config mismatch rather than anything upstream. That's why even httpbasics.py / nodehandling.py fail. Fix is to apply REGTEST_ACTIVATION_HEIGHTS to all nodes in setup_nodes(), not just when miner addresses exist — e.g.:

def setup_nodes(self):
    args = [
        ZebraArgs(
            miner_address=(self.miner_addresses[i] if self.miner_addresses else None),
            activation_heights=dict(REGTEST_ACTIVATION_HEIGHTS),
        )
        for i in range(self.num_nodes)
    ]
    return start_nodes(self.num_nodes, self.options.tmpdir, args)

(Any test overriding setup_network/setup_nodes with its own args needs the same activation heights, or its nodes won't match the shared cache.)

Minor: your bounded startup waits (PROCESS_STARTUP_TIMEOUT) and the ones I just added to #102 are the same change arrived at independently, so that part already lines up when we consolidate into #102.

The earlier layering suggestion still stands (disable on #101's per-shard mechanism; #102 as the home for the hardening), but your NU5 fix is the headline and should land regardless.

@dannywillems

Copy link
Copy Markdown
Contributor Author

This is a WIP, no need to review for now.

'-debug=mempool',
'-blockmaxsize=4000',
'-preferredtxversion=4',
'-allowdeprecated=getnewaddress',

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do believe it is expected to not be implemented in Zebra, as Zebra is a consensus node, and not a wallet.
Same argument for:

-allowdeprecated=z_getbalance
-allowdeprecated=z_getnewaddress

'-checkmempool',
'-debug=mempool',
'-blockmaxsize=4000',
'-preferredtxversion=4',

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude, by looking at Zebra code.

No config or flag for preferred tx version. Tx versions follow the active network upgrade / protocol rules.

args = [
'-checkmempool',
'-debug=mempool',
'-blockmaxsize=4000',

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude, by looking at Zebra code.

Max block size is the hard-coded consensus constant MAX_BLOCK_BYTES = 2_000_000 (zebra-chain/src/block/serialize.rs:24). Not configurable.

def setup_network(self):
args = [
'-checkmempool',
'-debug=mempool',

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude, by looking at Zebra code.

Zebra uses tracing filters, not -debug= flags. Equivalent: set a tracing filter like zebra_mempool=debug (via the [tracing] config or the --filters arg / RUST_LOG).


def setup_network(self):
args = [
'-checkmempool',

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Claude, by looking at Zebra code.

Zebra has an internal mempool queue checker (zebrad/src/components/mempool/queue_checker.rs) that runs automatically, but it is not a user-facing flag/option.

@dannywillems

Copy link
Copy Markdown
Contributor Author

Note:

auth-url-none (zebra RPC has no basic-auth)

Two tests are failing because of that.

@dannywillems dannywillems changed the title Fix CI DO NOT REVIEW - Fix CI Jun 5, 2026
@dannywillems dannywillems marked this pull request as ready for review June 5, 2026 15:22
dannywillems and others added 11 commits June 6, 2026 12:17
This test was never migrated to the Z3 stack. It passed a list of
zcashd-style args to start_node, which now expects a ZebraArgs dataclass,
so it crashed during setup_network with "'list' object has no attribute
'miner_address'". Because it inherits required: true from the Ubuntu
22.04 platform in CI, the failure blocked required runs.

Move it out of FLAKY_SCRIPTS into a new, documented DISABLED_SCRIPTS list
so it is excluded from the default run until it is migrated. A faithful
migration is non-trivial: the test exercises the Blossom, Heartwood,
Canopy and NU5 activation boundaries, but zebra regtest hardcodes
everything up to Canopy to height 1, and it relies on -blockmaxsize to
overflow the mempool, which the zebra regtest config does not expose.

Also convert setup_network to pass a ZebraArgs dataclass built from the
original activation heights, dropping the zcashd-only flags that have no
ZebraArgs equivalent, so the argument-passing crash is fixed regardless.
mempool_packages.py and the four SERIAL_SCRIPTS (mergetoaddress_sapling,
mergetoaddress_ua_nu5, mergetoaddress_ua_sapling, wallet_shieldingcoinbase)
have the same root cause as mempool_nu_activation.py: they pass lists of
zcashd-style args to start_node / start_nodes, which now expect a ZebraArgs
dataclass, so they crash during setup_network with "'list' object has no
attribute 'miner_address'". They inherit required: true from the platform in
CI, so they block required runs.

Several also depend on zcashd-only flags with no ZebraArgs equivalent
(-regtestshieldcoinbase, -anchorconfirmations, -limitancestorcount), so they
cannot be migrated by converting args alone.

Move them all into DISABLED_SCRIPTS until they are migrated, and note their
original SERIAL_SCRIPTS / FLAKY_SCRIPTS membership so they can be restored.
When zallet (or zebrad/zainod) fails during setup, for example a wallet
that cannot synchronize during cache rebuild ("Missing Orchard tree
state"), the run could hang until the CI job timed out (6 hours) instead
of failing. Two causes:

1. wait_for_bitcoind_start, wait_for_wallet_start and wait_for_zainod_start
   looped forever if a process started but never became ready (for example
   a port held by an orphaned process). They now give up after
   PROCESS_STARTUP_TIMEOUT seconds (default 120, overridable via env) and
   raise.

2. On the failure path, main()'s cleanup crashed on stop_wallets(None)
   because self.wallets was still None, so the remaining shutdown steps
   were skipped. Worse, the processes spawned during cache rebuild are
   tracked in module globals (not self.*), so they were never stopped and
   lingered, holding ports and the test's stdout pipe open, which is what
   hung the job.

Cleanup is now defensive: each RPC stop is guarded and best-effort, and a
new stop_all_processes() helper force-terminates everything still running
in the global process dicts (SIGTERM first, then a bounded wait, then
SIGKILL). wait_bitcoinds and wait_zallets are likewise bounded so a node
that ignores the stop request cannot hang teardown.

The result is that a failed setup now stops and fails quickly rather than
hanging the CI job.
The shared 'current' cache is built by rebuild_cache(), which starts zebrad
and then syncs zallet against it. zebrad was started with only a miner
address and no activation heights, so it used its regtest defaults, which
activate Overwinter through Canopy at height 1 but leave NU5 disabled.
zallet, however, is configured (qa/defaults/zallet/zallet.toml) to activate
NU5 at height 1. During sync zallet asks zebrad for the treestate at the
anchor height, and because zallet believes NU5 is active it requires an
Orchard tree; zebrad, with NU5 inactive, returns none, so zallet fails with
InvalidData { message: "Missing Orchard tree state" }. This broke every test
that builds the cache.

Activate NU5 at height 1 in the zebrad config used for the cache, matching
zallet, via a documented REGTEST_CACHE_ACTIVATION_HEIGHTS constant. Add
cross-referencing comments in both util.py and zallet.toml so the two
configurations stay in sync.

Verified locally: a fresh cache rebuild now mines to height 200 and zallet
syncs without the "Missing Orchard tree state" error.
The cache-rebuild path was fixed to activate NU5 so zallet can sync, but
clean-cache tests that run a wallet through the default setup_nodes() (for
example wallet.py) hit the same mismatch: zebrad started without NU5 while
zallet's regtest config activates NU5 at height 1, so wallet sync failed
with "Missing Orchard tree state".

Generalise the fix: the default setup_nodes() now starts zebrad with
activation_heights={"NU5": 1} to match zallet, and the shared constant is
renamed from REGTEST_CACHE_ACTIVATION_HEIGHTS to REGTEST_ACTIVATION_HEIGHTS
since it is no longer cache-specific. Tests that need different activation
heights continue to override setup_nodes() themselves.

Verified locally: wallet.py now passes and decodescript.py (the other
wallet test on the default setup path) still passes.
With cache_behavior='current'/'fresh', initialize_chain.init_from_cache()
copies each cached wallet dir (including wallet.db) into the test dir, but
setup_network -> prepare_wallets() -> prepare_wallets_for_mining() then
raised "Wallet N already exists, cannot prepare it for mining". This broke
every default-setup test that uses the chain cache and a wallet (about 30
tests: converttex.py, getchaintips.py, wallet_accounts.py, etc.).

rebuild_cache now records each wallet's mining address in a miner_address.txt
inside the wallet dir, which init_from_cache carries into the test dir.
prepare_wallets_for_mining reuses an already-present wallet by reading that
address instead of re-creating the wallet, so the test mines to the same
address the cached chain paid its coinbase to. Freshly-prepared wallets also
write the file so they can be captured into a cache. cache_rebuild_required
forces a rebuild of older caches that predate the recorded address.

Verified locally: converttex.py now passes against a freshly built cache.
The RPC test suite is mid-migration from zcashd to the Z3 stack (zebrad +
zaino + zallet). Most BASE_SCRIPTS are not migrated yet and fail on the new
stack (legacy list args to start_nodes, unimplemented zebra RPCs such as
gettxoutsetinfo and RPC basic-auth, zcashd-only config/log files, and
behavioural assertion mismatches), so the required RPC shards were always
red.

Take the pragmatic approach the suite already documents ("while we are
getting the existing tests to pass, skip tests that are not expected to
pass"): record every currently-failing test in DISABLED_SCRIPTS and subtract
DISABLED_SCRIPTS from the lists that are actually run, matching on the script
filename. The required set is now the migrated, passing subset; re-enable
each test as it is migrated. ALL_SCRIPTS is left complete so a disabled test
can still be run explicitly by name.

Verified against CI run 27016257863: the 103 disabled entries are exactly
the tests that failed there, and the remaining enabled tests are the ones
that passed.
The node no longer has a built-in wallet on the Z3 stack (the wallet is the
separate zallet process), so "running a node with -disablewallet" is just a
node with no wallet. Drop the custom setup_network and the -disablewallet
flag, set num_wallets = 0, and let the default setup_network start a single
walletless node. run_test only calls validateaddress, which zebrad
implements. Re-enable it in the test runner.

Verified locally: disablewallet.py passes.
Ran these disabled tests against the Z3 stack and annotated each with the
zcashd feature it still needs:

- errors.py, getchaintips.py: zebra missing gettxoutsetinfo / getchaintips
- nodehandling.py: zebra missing setban/listbanned/disconnectnode
- feature_walletfile.py: zcashd wallet.dat/-wallet, not used by zallet
- feature_logging.py: no zcashd-style regtest/debug.log on zebra

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
reorg_limit.py uses no wallet, but defaulted to num_wallets=4, so an
unused wallet failed to start and masked the real test. Set num_wallets=0
and activate NU5 at height 1 to match the cache; setup now succeeds.

The test still fails (kept disabled): after a network split where node 2
builds a longer chain, reconnecting does not make node 0 reorg to it, and
zebra does not reproduce zcashd's deep-reorg safety shutdown. TODO noted in
the disabled list.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dannywillems dannywillems force-pushed the fix/disable-mempool-nu-activation branch from a56ba05 to e13829d Compare June 6, 2026 10:22
dannywillems and others added 2 commits June 6, 2026 12:49
Add a terse inline note to each DISABLED_SCRIPTS entry: the missing feature
it needs (zallet or zebra RPC, pre-NU5 activation height, zcashd-only
behaviour, or the P2P/mininode framework), or that it just needs a ZebraArgs
migration. Classified from each test's RPC / nuparams / flag usage.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…test

MISSING-FEATURES.md: source-verified inventory of what keeps each migrated
zcashd RPC test disabled, grouped by component (zallet RPCs, zebra node
RPCs, regtest activation heights, zcashd-only behaviour, P2P framework),
plus the zcashd-deprecation -> zallet account/UA-API migration mapping.

rpc-tests.py: each DISABLED_SCRIPTS entry now notes the suggested migration
(e.g. getnewaddress -> z_getaddressforaccount) or the genuine gap, rather
than just "missing".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
dannywillems and others added 2 commits June 6, 2026 15:06
Migrating wallet_unified_change.py to the Z3 stack surfaced that zallet
z_sendmany cannot spend the harness-mined regtest coinbase: the wallet
reports the coinbase as transparent.regular.spendable (~318 ZEC) but
total.spendable is 0, and z_sendmany fails with "have 0". The mined
coinbase must first be shielded with z_shieldcoinbase.

This makes every coinbase-funded wallet test a restructure (add a
z_shieldcoinbase funding step plus the account-model rewrite), not a
simple RPC rename. Record this in MISSING-FEATURES.md along with the
other zallet API specifics found (z_getnewaccount needs a name and
returns account_uuid with a null ZIP 32 index, z_sendmany requires a
null fee, default UA derivation hits the transparent gap limit), and
update the disabled-reason comment for wallet_unified_change.py.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a "Trial migration outcome" section to MISSING-FEATURES.md capturing
the six structural steps a coinbase-funded wallet test needs on the Z3
stack (z_shieldcoinbase funding, account_uuid identity, z_getbalances
reshape, null fee, explicit shielded receivers, self.wallets/miner_address
wiring). The trial was stopped before the z_shieldcoinbase funding
rewrite; wallet_unified_change.py stays disabled. The steps are the recipe
for resuming the wallet-test migration.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants