Skip to content

test: showcase contention during asset generation, analyse problem and solutions#3374

Open
gilcu3 wants to merge 2 commits into
mainfrom
1175-investigate-why-asset-generation-impacts-signing-performance
Open

test: showcase contention during asset generation, analyse problem and solutions#3374
gilcu3 wants to merge 2 commits into
mainfrom
1175-investigate-why-asset-generation-impacts-signing-performance

Conversation

@gilcu3
Copy link
Copy Markdown
Contributor

@gilcu3 gilcu3 commented May 27, 2026

Closes #1175

Notice here we don't deal with the indexer being part of the problem, which we can somehow safely assume as we only face this right after resharings

@gilcu3 gilcu3 linked an issue May 27, 2026 that may be closed by this pull request
@gilcu3 gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch 3 times, most recently from fca8b91 to a9f0274 Compare May 29, 2026 15:18
@gilcu3 gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch from a9f0274 to 75f0b62 Compare June 1, 2026 07:52
@gilcu3 gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch from 32b5346 to 4c050fe Compare June 4, 2026 17:22
@gilcu3 gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch from 4c050fe to 45b2987 Compare June 5, 2026 05:59
Comment on lines +93 to +96
/// Presignatures are buffered modestly and identically in both scenarios, so they
/// stay available for signing without themselves being the variable under test.
const PRESIGNATURE_CONCURRENCY: usize = 2;
const PRESIGNATURES_TO_BUFFER: usize = 64;
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this test focuses on the triples, but the presignatures might have some part in the problem, as they are also computed aggressively after resharing.

@gilcu3 gilcu3 changed the title test: showcase contention during asset generation test: showcase contention during asset generation, analyse problem and solutions Jun 5, 2026
@gilcu3 gilcu3 marked this pull request as ready for review June 5, 2026 06:09
@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

PR title type suggestion: This PR changes source code files (p2p.rs) in addition to adding tests and documentation, so the type prefix should probably be refactor: instead of test:.

Suggested title: refactor: showcase contention during asset generation, analyse problem and solutions

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

Pull request overview

Adds a self-contained reproduction for issue #1175 ("asset generation impacts signing performance"): a 4-node in-process cluster that measures signing latency in two states — steady (small buffers, idle generation) vs. simulated post-resharing (mainnet-sized buffers, aggressive refill) — and asserts the second case degrades. Also adds a design doc enumerating possible solutions (lower-OS-priority gen runtime, bounded follower fan-out, yield_now(), operational knobs, etc.) without picking one. No production code touched; this is analysis + scaffolding for a follow-up fix PR.

Changes:

  • New tests/asset_generation_signing_contention.rs integration test (#[ignore], run manually) that compares signing p50/p90/max under two refill regimes.
  • New LatencyReport aggregator built on the average crate (Mean/Quantile/Max), mirroring the threshold-signatures bench utilities.
  • Adds average as a dev-dependency for mpc-node and bumps PortSeed::TOTAL_DEFINED_PORTS to allocate ASSET_GENERATION_SIGNING_CONTENTION_TEST = 22.
  • Adds docs/design/signing-starvation-solution.md cataloguing root-cause facts and candidate fixes A, A.1, B, C, plus operational and protocol-level options.

Reviewed changes

Per-file summary
File Description
Cargo.lock Pulls in average dev-dep transitively.
crates/node/Cargo.toml Adds average to [dev-dependencies].
crates/node/src/p2p.rs Adds ASSET_GENERATION_SIGNING_CONTENTION_TEST PortSeed; bumps TOTAL_DEFINED_PORTS 22 to 23.
crates/node/src/tests.rs Registers the new test module.
crates/node/src/tests/asset_generation_signing_contention.rs The reproduction: builds two clusters back-to-back, sends warmup + measured signatures, asserts steady.timeouts == 0 and post_resharing.timeouts > 0 || post_resharing.p50 >= steady.p50 * 2.
docs/design/signing-starvation-solution.md Root-cause writeup and solution-option catalogue.

Findings

Blocking (must fix before merge):

  • None.

Non-blocking (nits, follow-ups, suggestions):

  • crates/node/src/tests/asset_generation_signing_contention.rs:96 — gilcu3's open thread is still valid: the test holds PRESIGNATURE_CONCURRENCY and PRESIGNATURES_TO_BUFFER constant across both scenarios, so the "post-resharing" arm only exercises triple-side pressure. The design doc explicitly calls out that presignature pipelines also empty at resharing and that presignature.concurrency is itself a candidate knob (docs/design/signing-starvation-solution.md:95), so the reproduction under-models the very source it documents. Worth either (a) adding a third scenario that also raises presig pressure or (b) tightening the docstring to say "isolates the triple-refill component of Investigate why asset generation impacts signing performance #1175" rather than "post-resharing".
  • crates/node/src/tests/asset_generation_signing_contention.rs:175 — the label "post-resharing" is used throughout, but no resharing happens; the test simulates the resulting state by sizing buffers. The module docstring is honest about this, but the function-level naming (buffers_empty: bool, case: u16 with magic 0/1) buries it. A Scenario { Steady, ColdBuffers } enum would let the call sites read measure_signing_latency(Scenario::ColdBuffers) and remove the parallel case argument, which today couples two unrelated concerns (which knobs to use, which port slot to use).
  • crates/node/src/tests/asset_generation_signing_contention.rs:42 — the docstring papers over Result::unwrap() on an Err value: Closed panics from indexer/fake.rs during teardown as "pre-existing ... don't affect the result". If a future reader runs this test cold they'll reasonably treat those panics as a regression. Worth either filing a follow-up to fix the shutdown race or linking to an existing issue from the doc comment instead of just describing the symptom.
  • crates/node/src/tests/asset_generation_signing_contention.rs:296 — the disjunctive assertion (timeouts > 0 || p50 >= steady.p50 * 2) means a cluster that fails to bring up in the second arm (zero signatures complete, so all timeouts) is indistinguishable from a cluster that degrades under load. The steady-state harness check at line 292 only validates the first run; consider also asserting post_resharing.n_ok > 0 (or a minimum success count) so genuine setup failure does not masquerade as the bug being reproduced.
  • docs/design/signing-starvation-solution.md:36 — option labels skip from A/A.1/B/C straight to "Lower presignature.concurrency" / "Reduce SUPPORTED_TRIPLE_GENERATION_BATCH_SIZE" / "E", with no D. Minor, but if downstream PRs reference these by letter the numbering will be confusing — either rename the unlabelled ones D and F, or drop the lettering entirely and reference them by name.

✅ Approved

@claude
Copy link
Copy Markdown

claude Bot commented Jun 5, 2026

PR title type suggestion: This PR changes source code files (p2p.rs) along with tests and docs, so the type prefix should probably be fix: or perf: instead of test:. The test: type is for test-only changes.

Suggested title: perf: reduce contention during asset generation or fix: address signing starvation in asset generation

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice this would not be executed in CI. The plan is to have it so that once the fixes are applied we can check if the behavior changes

Copy link
Copy Markdown
Collaborator

@netrome netrome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for investigating and writing this up. I'm experiencing some issues running the test, but since it's ignored I don't consider it a hard blocker.

}

#[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 4))]
#[ignore = "timing-sensitive reproduction for #1175; run manually with --run-ignored"]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh didn't know until now the MetaNameValueStr syntax was supported for ignore attributes https://doc.rust-lang.org/reference/attributes/testing.html#r-attributes.testing.ignore.reason

Not sure what the benefit over a normal comment is 🤔

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I also didn't know this was called MetaNameValueStr until now when I looked it up 😅 )

Comment on lines +18 to +20
2. **CPU-bound, non-yielding poke loop.** `run_protocol` in `protocol.rs`
runs `protocol.poke()` until `Action::Wait`. A 64-batch triple gen burst
is tens-to-hundreds of ms between awaits.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this sounds exactly like the kind of work tokio::spawn_blocking is for. Alternatively we should consider inserting more await points.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this sounds exactly like the kind of work tokio::spawn_blocking is for.

Yes, that should at least fix the contention problem on the tokio runtime.

Are we considering adding waiting points inside the crypto implementation?

Comment on lines +21 to +24
3. **Unbounded follower fan-out.**
`mpc_client.rs::monitor_passive_channels_inner` spawns one task per
incoming peer channel with no cap, so a node has no way to bound how much
follower work peers can push onto it.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is per network peer? Since we typically have ~10 participants this doesn't sound like it should be any problem.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is, because each peer launches 2 batch triple computations (64 triples each batch) at the same time. So when we compute the load after resharing, it scales linearly with the number of participants.

Basically with 10 nodes, right after resharing, each node now will start concurrently computing 64102 triples, and 16 (presignature concurrency) * 10 presignatures (as soon as triples becomes available). That seems like a lot, but the important bit is that due to work started by peers, it scales linearly in each node.

Comment on lines +30 to +33
5. **Mainnet runs two CaitSith domains.** Presignature generation runs
per-domain (one background loop per `(provider, domain)` in
`spawn_background_tasks`), so the per-node presig pipeline doubles.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, but this problem was observed before we ran two CaitSith domains.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that just maybe made it a bit worse, although as they share triples, and we also decreased the # triples from 1M to 16K, probably we never really observed the diff.

Comment on lines +115 to +118
### E — `spawn_blocking` or dedicated rayon pool

Move CPU-bound compute off async worker threads entirely. Overlaps with what
A achieves through runtime separation, but at a deeper restructuring layer.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this sounds like something we should have added already. Should be a no-brainer for any CPU kind of work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The one things I am not 100% sure of is if triple generation is just CPU bound, or the network has also a big influence.

Comment on lines +85 to +94
### C — `yield_now()` in the poke loop

A single `tokio::task::yield_now().await` in `run_protocol`'s outer loop
shortens the maximum CPU burst between cooperative yield points
(a comment in `protocol.rs` already documents the hazard).

With A in place this mainly helps fairness *within* `gen_runtime`; gen tasks
yielding to each other does not free a core for signing. Cheap, harmless,
defensible independently of #1175.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah definitely worth exploring, unless we go with E at which point this shouldn't matter as much.


## Solution options

### A — Lower-OS-priority gen runtime
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would much prefer not to add additional runtimes, but if not E and possibly C helps with this we should definitely consider something like this.

#[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 4))]
#[ignore = "timing-sensitive reproduction for #1175; run manually with --run-ignored"]
#[expect(non_snake_case)]
async fn signing_latency__should_degrade_under_concurrent_asset_generation() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I tried to run this on my machine now and I'm getting this error:


    thread 'tests::asset_generation_signing_contention::signing_latency__should_degrade_under_concurrent_asset_generation' (419399) panicked at crates/node/src/tests/asset_generation_signing_contention.rs:292:5:
    assertion `left == right` failed: steady-state baseline should not time out; harness is unhealthy

anything you recognize?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we discussed in the thread, probably due to the use of test-release mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate why asset generation impacts signing performance

2 participants