test: showcase contention during asset generation, analyse problem and solutions#3374
test: showcase contention during asset generation, analyse problem and solutions#3374gilcu3 wants to merge 2 commits into
Conversation
fca8b91 to
a9f0274
Compare
a9f0274 to
75f0b62
Compare
32b5346 to
4c050fe
Compare
4c050fe to
45b2987
Compare
| /// Presignatures are buffered modestly and identically in both scenarios, so they | ||
| /// stay available for signing without themselves being the variable under test. | ||
| const PRESIGNATURE_CONCURRENCY: usize = 2; | ||
| const PRESIGNATURES_TO_BUFFER: usize = 64; |
There was a problem hiding this comment.
this test focuses on the triples, but the presignatures might have some part in the problem, as they are also computed aggressively after resharing.
|
PR title type suggestion: This PR changes source code files (p2p.rs) in addition to adding tests and documentation, so the type prefix should probably be Suggested title: |
Pull request overviewAdds a self-contained reproduction for issue #1175 ("asset generation impacts signing performance"): a 4-node in-process cluster that measures signing latency in two states — steady (small buffers, idle generation) vs. simulated post-resharing (mainnet-sized buffers, aggressive refill) — and asserts the second case degrades. Also adds a design doc enumerating possible solutions (lower-OS-priority gen runtime, bounded follower fan-out, Changes:
Reviewed changesPer-file summary
FindingsBlocking (must fix before merge):
Non-blocking (nits, follow-ups, suggestions):
✅ Approved |
|
PR title type suggestion: This PR changes source code files (p2p.rs) along with tests and docs, so the type prefix should probably be Suggested title: |
There was a problem hiding this comment.
Notice this would not be executed in CI. The plan is to have it so that once the fixes are applied we can check if the behavior changes
netrome
left a comment
There was a problem hiding this comment.
Thanks for investigating and writing this up. I'm experiencing some issues running the test, but since it's ignored I don't consider it a hard blocker.
| } | ||
|
|
||
| #[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 4))] | ||
| #[ignore = "timing-sensitive reproduction for #1175; run manually with --run-ignored"] |
There was a problem hiding this comment.
Oh didn't know until now the MetaNameValueStr syntax was supported for ignore attributes https://doc.rust-lang.org/reference/attributes/testing.html#r-attributes.testing.ignore.reason
Not sure what the benefit over a normal comment is 🤔
There was a problem hiding this comment.
(I also didn't know this was called MetaNameValueStr until now when I looked it up 😅 )
| 2. **CPU-bound, non-yielding poke loop.** `run_protocol` in `protocol.rs` | ||
| runs `protocol.poke()` until `Action::Wait`. A 64-batch triple gen burst | ||
| is tens-to-hundreds of ms between awaits. |
There was a problem hiding this comment.
Oh this sounds exactly like the kind of work tokio::spawn_blocking is for. Alternatively we should consider inserting more await points.
There was a problem hiding this comment.
Oh this sounds exactly like the kind of work tokio::spawn_blocking is for.
Yes, that should at least fix the contention problem on the tokio runtime.
Are we considering adding waiting points inside the crypto implementation?
| 3. **Unbounded follower fan-out.** | ||
| `mpc_client.rs::monitor_passive_channels_inner` spawns one task per | ||
| incoming peer channel with no cap, so a node has no way to bound how much | ||
| follower work peers can push onto it. |
There was a problem hiding this comment.
This is per network peer? Since we typically have ~10 participants this doesn't sound like it should be any problem.
There was a problem hiding this comment.
I think it is, because each peer launches 2 batch triple computations (64 triples each batch) at the same time. So when we compute the load after resharing, it scales linearly with the number of participants.
Basically with 10 nodes, right after resharing, each node now will start concurrently computing 64102 triples, and 16 (presignature concurrency) * 10 presignatures (as soon as triples becomes available). That seems like a lot, but the important bit is that due to work started by peers, it scales linearly in each node.
| 5. **Mainnet runs two CaitSith domains.** Presignature generation runs | ||
| per-domain (one background loop per `(provider, domain)` in | ||
| `spawn_background_tasks`), so the per-node presig pipeline doubles. | ||
|
|
There was a problem hiding this comment.
Sure, but this problem was observed before we ran two CaitSith domains.
There was a problem hiding this comment.
Right, that just maybe made it a bit worse, although as they share triples, and we also decreased the # triples from 1M to 16K, probably we never really observed the diff.
| ### E — `spawn_blocking` or dedicated rayon pool | ||
|
|
||
| Move CPU-bound compute off async worker threads entirely. Overlaps with what | ||
| A achieves through runtime separation, but at a deeper restructuring layer. |
There was a problem hiding this comment.
Yes, this sounds like something we should have added already. Should be a no-brainer for any CPU kind of work.
There was a problem hiding this comment.
The one things I am not 100% sure of is if triple generation is just CPU bound, or the network has also a big influence.
| ### C — `yield_now()` in the poke loop | ||
|
|
||
| A single `tokio::task::yield_now().await` in `run_protocol`'s outer loop | ||
| shortens the maximum CPU burst between cooperative yield points | ||
| (a comment in `protocol.rs` already documents the hazard). | ||
|
|
||
| With A in place this mainly helps fairness *within* `gen_runtime`; gen tasks | ||
| yielding to each other does not free a core for signing. Cheap, harmless, | ||
| defensible independently of #1175. | ||
|
|
There was a problem hiding this comment.
Yeah definitely worth exploring, unless we go with E at which point this shouldn't matter as much.
|
|
||
| ## Solution options | ||
|
|
||
| ### A — Lower-OS-priority gen runtime |
There was a problem hiding this comment.
I would much prefer not to add additional runtimes, but if not E and possibly C helps with this we should definitely consider something like this.
| #[test_log::test(tokio::test(flavor = "multi_thread", worker_threads = 4))] | ||
| #[ignore = "timing-sensitive reproduction for #1175; run manually with --run-ignored"] | ||
| #[expect(non_snake_case)] | ||
| async fn signing_latency__should_degrade_under_concurrent_asset_generation() { |
There was a problem hiding this comment.
Hmm I tried to run this on my machine now and I'm getting this error:
thread 'tests::asset_generation_signing_contention::signing_latency__should_degrade_under_concurrent_asset_generation' (419399) panicked at crates/node/src/tests/asset_generation_signing_contention.rs:292:5:
assertion `left == right` failed: steady-state baseline should not time out; harness is unhealthy
anything you recognize?
There was a problem hiding this comment.
we discussed in the thread, probably due to the use of test-release mode
Closes #1175
Notice here we don't deal with the indexer being part of the problem, which we can somehow safely assume as we only face this right after resharings