test: showcase contention during asset generation, analyse problem and solutions by gilcu3 · Pull Request #3374 · near/mpc

gilcu3 · 2026-05-27T11:12:41Z

Notice here we don't deal with the indexer being part of the problem, which we can somehow safely assume as we only face this right after resharings

gilcu3 · 2026-06-05T06:06:57Z

+/// Presignatures are buffered modestly and identically in both scenarios, so they
+/// stay available for signing without themselves being the variable under test.
+const PRESIGNATURE_CONCURRENCY: usize = 2;
+const PRESIGNATURES_TO_BUFFER: usize = 64;


this test focuses on the triples, but the presignatures might have some part in the problem, as they are also computed aggressively after resharing.

claude · 2026-06-05T06:10:41Z

PR title type suggestion: This PR changes source code files (p2p.rs) in addition to adding tests and documentation, so the type prefix should probably be refactor: instead of test:.

Suggested title: refactor: showcase contention during asset generation, analyse problem and solutions

claude · 2026-06-05T06:14:14Z

Pull request overview

Adds a self-contained reproduction for issue #1175 ("asset generation impacts signing performance"): a 4-node in-process cluster that measures signing latency in two states — steady (small buffers, idle generation) vs. simulated post-resharing (mainnet-sized buffers, aggressive refill) — and asserts the second case degrades. Also adds a design doc enumerating possible solutions (lower-OS-priority gen runtime, bounded follower fan-out, yield_now(), operational knobs, etc.) without picking one. No production code touched; this is analysis + scaffolding for a follow-up fix PR.

Changes:

New tests/asset_generation_signing_contention.rs integration test (#[ignore], run manually) that compares signing p50/p90/max under two refill regimes.
New LatencyReport aggregator built on the average crate (Mean/Quantile/Max), mirroring the threshold-signatures bench utilities.
Adds average as a dev-dependency for mpc-node and bumps PortSeed::TOTAL_DEFINED_PORTS to allocate ASSET_GENERATION_SIGNING_CONTENTION_TEST = 22.
Adds docs/design/signing-starvation-solution.md cataloguing root-cause facts and candidate fixes A, A.1, B, C, plus operational and protocol-level options.

Reviewed changes

Per-file summary

File	Description
Cargo.lock	Pulls in `average` dev-dep transitively.
crates/node/Cargo.toml	Adds `average` to `[dev-dependencies]`.
crates/node/src/p2p.rs	Adds `ASSET_GENERATION_SIGNING_CONTENTION_TEST` PortSeed; bumps `TOTAL_DEFINED_PORTS` 22 to 23.
crates/node/src/tests.rs	Registers the new test module.
crates/node/src/tests/asset_generation_signing_contention.rs	The reproduction: builds two clusters back-to-back, sends warmup + measured signatures, asserts `steady.timeouts == 0` and `post_resharing.timeouts > 0 \|\| post_resharing.p50 >= steady.p50 * 2`.
docs/design/signing-starvation-solution.md	Root-cause writeup and solution-option catalogue.

Findings

Blocking (must fix before merge):

None.

Non-blocking (nits, follow-ups, suggestions):

crates/node/src/tests/asset_generation_signing_contention.rs:96 — gilcu3's open thread is still valid: the test holds PRESIGNATURE_CONCURRENCY and PRESIGNATURES_TO_BUFFER constant across both scenarios, so the "post-resharing" arm only exercises triple-side pressure. The design doc explicitly calls out that presignature pipelines also empty at resharing and that presignature.concurrency is itself a candidate knob (docs/design/signing-starvation-solution.md:95), so the reproduction under-models the very source it documents. Worth either (a) adding a third scenario that also raises presig pressure or (b) tightening the docstring to say "isolates the triple-refill component of Investigate why asset generation impacts signing performance #1175" rather than "post-resharing".
crates/node/src/tests/asset_generation_signing_contention.rs:175 — the label "post-resharing" is used throughout, but no resharing happens; the test simulates the resulting state by sizing buffers. The module docstring is honest about this, but the function-level naming (buffers_empty: bool, case: u16 with magic 0/1) buries it. A Scenario { Steady, ColdBuffers } enum would let the call sites read measure_signing_latency(Scenario::ColdBuffers) and remove the parallel case argument, which today couples two unrelated concerns (which knobs to use, which port slot to use).
crates/node/src/tests/asset_generation_signing_contention.rs:42 — the docstring papers over Result::unwrap() on an Err value: Closed panics from indexer/fake.rs during teardown as "pre-existing ... don't affect the result". If a future reader runs this test cold they'll reasonably treat those panics as a regression. Worth either filing a follow-up to fix the shutdown race or linking to an existing issue from the doc comment instead of just describing the symptom.
crates/node/src/tests/asset_generation_signing_contention.rs:296 — the disjunctive assertion (timeouts > 0 || p50 >= steady.p50 * 2) means a cluster that fails to bring up in the second arm (zero signatures complete, so all timeouts) is indistinguishable from a cluster that degrades under load. The steady-state harness check at line 292 only validates the first run; consider also asserting post_resharing.n_ok > 0 (or a minimum success count) so genuine setup failure does not masquerade as the bug being reproduced.
docs/design/signing-starvation-solution.md:36 — option labels skip from A/A.1/B/C straight to "Lower presignature.concurrency" / "Reduce SUPPORTED_TRIPLE_GENERATION_BATCH_SIZE" / "E", with no D. Minor, but if downstream PRs reference these by letter the numbering will be confusing — either rename the unlabelled ones D and F, or drop the lettering entirely and reference them by name.

✅ Approved

claude · 2026-06-05T06:14:21Z

PR title type suggestion: This PR changes source code files (p2p.rs) along with tests and docs, so the type prefix should probably be fix: or perf: instead of test:. The test: type is for test-only changes.

Suggested title: perf: reduce contention during asset generation or fix: address signing starvation in asset generation

gilcu3 · 2026-06-05T06:16:35Z

Notice this would not be executed in CI. The plan is to have it so that once the fixes are applied we can check if the behavior changes

netrome

Thanks for investigating and writing this up. I'm experiencing some issues running the test, but since it's ignored I don't consider it a hard blocker.

netrome · 2026-06-05T11:15:58Z

+}
+
+#[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 4))]
+#[ignore = "timing-sensitive reproduction for #1175; run manually with --run-ignored"]


Oh didn't know until now the MetaNameValueStr syntax was supported for ignore attributes https://doc.rust-lang.org/reference/attributes/testing.html#r-attributes.testing.ignore.reason

Not sure what the benefit over a normal comment is 🤔

(I also didn't know this was called MetaNameValueStr until now when I looked it up 😅 )

netrome · 2026-06-05T11:40:08Z

+2. **CPU-bound, non-yielding poke loop.** `run_protocol` in `protocol.rs`
+   runs `protocol.poke()` until `Action::Wait`. A 64-batch triple gen burst
+   is tens-to-hundreds of ms between awaits.


Oh this sounds exactly like the kind of work tokio::spawn_blocking is for. Alternatively we should consider inserting more await points.

Oh this sounds exactly like the kind of work tokio::spawn_blocking is for.

Yes, that should at least fix the contention problem on the tokio runtime.

Are we considering adding waiting points inside the crypto implementation?

netrome · 2026-06-05T11:42:12Z

+3. **Unbounded follower fan-out.**
+   `mpc_client.rs::monitor_passive_channels_inner` spawns one task per
+   incoming peer channel with no cap, so a node has no way to bound how much
+   follower work peers can push onto it.


This is per network peer? Since we typically have ~10 participants this doesn't sound like it should be any problem.

I think it is, because each peer launches 2 batch triple computations (64 triples each batch) at the same time. So when we compute the load after resharing, it scales linearly with the number of participants.

Basically with 10 nodes, right after resharing, each node now will start concurrently computing 64102 triples, and 16 (presignature concurrency) * 10 presignatures (as soon as triples becomes available). That seems like a lot, but the important bit is that due to work started by peers, it scales linearly in each node.

netrome · 2026-06-05T11:43:06Z

+5. **Mainnet runs two CaitSith domains.** Presignature generation runs
+   per-domain (one background loop per `(provider, domain)` in
+   `spawn_background_tasks`), so the per-node presig pipeline doubles.
+


Sure, but this problem was observed before we ran two CaitSith domains.

Right, that just maybe made it a bit worse, although as they share triples, and we also decreased the # triples from 1M to 16K, probably we never really observed the diff.

netrome · 2026-06-05T11:44:23Z

+### E — `spawn_blocking` or dedicated rayon pool
+
+Move CPU-bound compute off async worker threads entirely. Overlaps with what
+A achieves through runtime separation, but at a deeper restructuring layer.


Yes, this sounds like something we should have added already. Should be a no-brainer for any CPU kind of work.

The one things I am not 100% sure of is if triple generation is just CPU bound, or the network has also a big influence.

netrome · 2026-06-05T11:44:47Z

+### C — `yield_now()` in the poke loop
+
+A single `tokio::task::yield_now().await` in `run_protocol`'s outer loop
+shortens the maximum CPU burst between cooperative yield points
+(a comment in `protocol.rs` already documents the hazard).
+
+With A in place this mainly helps fairness *within* `gen_runtime`; gen tasks
+yielding to each other does not free a core for signing. Cheap, harmless,
+defensible independently of #1175.
+


Yeah definitely worth exploring, unless we go with E at which point this shouldn't matter as much.

netrome · 2026-06-05T11:45:19Z

+
+## Solution options
+
+### A — Lower-OS-priority gen runtime


I would much prefer not to add additional runtimes, but if not E and possibly C helps with this we should definitely consider something like this.

netrome · 2026-06-05T11:46:12Z

+#[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 4))]
+#[ignore = "timing-sensitive reproduction for #1175; run manually with --run-ignored"]
+#[expect(non_snake_case)]
+async fn signing_latency__should_degrade_under_concurrent_asset_generation() {


Hmm I tried to run this on my machine now and I'm getting this error:

thread 'tests::asset_generation_signing_contention::signing_latency__should_degrade_under_concurrent_asset_generation' (419399) panicked at crates/node/src/tests/asset_generation_signing_contention.rs:292:5: assertion `left == right` failed: steady-state baseline should not time out; harness is unhealthy

anything you recognize?

we discussed in the thread, probably due to the use of test-release mode

gilcu3 linked an issue May 27, 2026 that may be closed by this pull request

Investigate why asset generation impacts signing performance #1175

Open

gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch 3 times, most recently from fca8b91 to a9f0274 Compare May 29, 2026 15:18

gilcu3 mentioned this pull request Jun 1, 2026

Investigate why asset generation impacts signing performance #1175

Open

gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch from a9f0274 to 75f0b62 Compare June 1, 2026 07:52

test: showcase contention during asset generation

7b4a5e0

gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch from 32b5346 to 4c050fe Compare June 4, 2026 17:22

docs: add signing starvation problem analysis and some solutions

45b2987

gilcu3 force-pushed the 1175-investigate-why-asset-generation-impacts-signing-performance branch from 4c050fe to 45b2987 Compare June 5, 2026 05:59

gilcu3 commented Jun 5, 2026

View reviewed changes

gilcu3 changed the title ~~test: showcase contention during asset generation~~ test: showcase contention during asset generation, analyse problem and solutions Jun 5, 2026

gilcu3 marked this pull request as ready for review June 5, 2026 06:09

gilcu3 commented Jun 5, 2026

View reviewed changes

netrome approved these changes Jun 5, 2026

View reviewed changes

Conversation

gilcu3 commented May 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

claude Bot commented Jun 5, 2026

Pull request overview

Reviewed changes

Findings

Uh oh!

claude Bot commented Jun 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

netrome left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

gilcu3 commented May 27, 2026 •

edited

Loading