feat: close group stability for merkle payment verification#45
feat: close group stability for merkle payment verification#45
Conversation
The DHT routing tables are severely underpopulated — nodes discover only
~13% of the network via DHT despite being connected to ~90% via transport.
This causes close group disagreements at scale which would break merkle
payment verification (client quoting and node verification return different
closest-node sets).
Changes:
1. DHT routing table refresh background task (src/node.rs, tests/e2e/testnet.rs)
- Periodic random lookups every 30s (production) / 10s (tests) to keep
routing tables populated
- Improved routing table discovery from 13.3% to 16.7% at 100 nodes
- At 25 nodes: quoting-vs-verification Jaccard improved from 0.983 to 1.000
2. Close group confirmation protocol (src/close_group.rs)
- confirm_close_group(): multi-node consensus on close group membership
- is_node_in_close_group(): self-check for close group responsibility
- Sorted by XOR distance, threshold-based filtering
3. Close group validation in PUT handler (src/storage/handler.rs)
- AntProtocol checks if node is in close group before accepting PUT
- Uses OnceLock<Arc<P2PNode>> for deferred initialization
- Prevents storing data this node is not responsible for
4. Comprehensive test suite (tests/e2e/close_group_stability.rs)
- 8 tests: routing table completeness, ground truth overlap,
quoting-vs-verification agreement, temporal stability, result sizing
- Tests at 5, 25, and 100 node scales
- Detailed metrics output for diagnosis
5. Findings document (docs/close_group_stability_findings.md)
- Before/after comparison of all metrics
- Root cause analysis confirming DHT population is the primary issue
- Geo-location confirmed NOT the culprit in test environment
|
@claude @greptile review please |
Add allow directives for clippy lints that are acceptable in test code (cast_precision_loss, too_many_lines, doc_markdown, uninlined_format_args, items_after_statements). Remove unused per_target_results and response_rate variables.
There was a problem hiding this comment.
Pull request overview
This PR improves close-group stability to support merkle payment verification by proactively refreshing DHT routing tables, adding utilities to confirm close-group membership, enforcing (optional) close-group validation on PUT, and introducing new large-scale E2E diagnostics/tests.
Changes:
- Add periodic DHT routing-table refresh tasks in production nodes and the E2E testnet harness.
- Add
close_grouputilities (confirm_close_group,is_node_in_close_group) and wire P2P node access intoAntProtocolfor PUT-time validation. - Add new E2E test modules for close-group stability and 100-node network convergence, plus a findings document.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
src/node.rs |
Wraps P2P node in Arc, wires it into AntProtocol, and starts/stops a periodic DHT refresh task. |
src/storage/handler.rs |
Adds optional P2P node plumbing via OnceLock and enforces close-group membership before payment verification on PUT. |
src/close_group.rs |
New close-group confirmation / membership utilities using DHT lookups. |
src/lib.rs |
Exposes the new close_group module. |
tests/e2e/testnet.rs |
Wires P2P node into AntProtocol in the harness and adds a per-node DHT refresh background task. |
tests/e2e/close_group_stability.rs |
New E2E suite measuring routing completeness, ground-truth overlap, quoting-vs-verification agreement, and temporal stability at 5/25/100 nodes. |
tests/e2e/network_convergence.rs |
New 100-node DHT “lookup path convergence” test and warmup routines. |
tests/e2e/mod.rs |
Registers the new E2E test modules. |
docs/close_group_stability_findings.md |
Documents before/after metrics, root causes, and implemented mitigations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/close_group.rs
Outdated
| // Check if we are closer than or equal to the furthest peer in the group | ||
| if peers.len() < CLOSE_GROUP_SIZE { | ||
| // Not enough peers found — we are likely in the close group | ||
| return true; | ||
| } | ||
|
|
||
| // Check if we're closer than the furthest member | ||
| let furthest_distance = peers | ||
| .iter() | ||
| .filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex())) | ||
| .map(|xor| xor_distance(target, &xor)) | ||
| .max(); | ||
|
|
||
| furthest_distance.map_or(true, |furthest| my_distance <= furthest) | ||
| } | ||
| _ => { | ||
| // Lookup failed — can't determine, assume we're in to avoid | ||
| // rejecting valid payments | ||
| true |
There was a problem hiding this comment.
Returning true when peers.len() < CLOSE_GROUP_SIZE is very permissive and can cause handle_put to accept storage even when the node couldn’t retrieve a full close group (which may indicate incomplete routing rather than responsibility). Consider treating undersized results as “cannot confirm” and returning false/Err so the PUT can be rejected or retried.
| // Check if we are closer than or equal to the furthest peer in the group | |
| if peers.len() < CLOSE_GROUP_SIZE { | |
| // Not enough peers found — we are likely in the close group | |
| return true; | |
| } | |
| // Check if we're closer than the furthest member | |
| let furthest_distance = peers | |
| .iter() | |
| .filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex())) | |
| .map(|xor| xor_distance(target, &xor)) | |
| .max(); | |
| furthest_distance.map_or(true, |furthest| my_distance <= furthest) | |
| } | |
| _ => { | |
| // Lookup failed — can't determine, assume we're in to avoid | |
| // rejecting valid payments | |
| true | |
| // We can only confirm responsibility if we obtained a full close group. | |
| // Undersized results may indicate incomplete routing, so treat as | |
| // "cannot confirm" rather than assuming responsibility. | |
| if peers.len() < CLOSE_GROUP_SIZE { | |
| return false; | |
| } | |
| // Check if we're closer than or equal to the furthest member | |
| let furthest_distance = peers | |
| .iter() | |
| .filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex())) | |
| .map(|xor| xor_distance(target, &xor)) | |
| .max(); | |
| // If we cannot compute any distances, we cannot confirm responsibility. | |
| furthest_distance.map_or(false, |furthest| my_distance <= furthest) | |
| } | |
| _ => { | |
| // Lookup failed — can't determine, so do NOT confirm responsibility. | |
| false |
src/close_group.rs
Outdated
| _ => { | ||
| // Lookup failed — can't determine, assume we're in to avoid | ||
| // rejecting valid payments | ||
| true | ||
| } |
There was a problem hiding this comment.
On lookup failure/timeout this function returns true, which means close-group validation in PUT becomes a no-op whenever the DHT is unhealthy (exactly the scenario this is meant to guard). Consider returning false (or a Result so callers can return a retryable error) and logging at warn level to aid diagnosis.
| _ => { | |
| // Lookup failed — can't determine, assume we're in to avoid | |
| // rejecting valid payments | |
| true | |
| } | |
| Ok(Err(err)) => { | |
| // Lookup failed — treat as not in close group and log for diagnosis. | |
| warn!( | |
| ?err, | |
| "DHT close group lookup failed while checking node responsibility; \ | |
| treating node as not in close group" | |
| ); | |
| false | |
| } | |
| Err(elapsed) => { | |
| // Lookup timed out — treat as not in close group and log for diagnosis. | |
| warn!( | |
| ?elapsed, | |
| "DHT close group lookup timed out after {:?} while checking node responsibility; \ | |
| treating node as not in close group", | |
| CONFIRMATION_LOOKUP_TIMEOUT | |
| ); | |
| false | |
| } |
| for &obs_idx in observers { | ||
| if let Some(p2p) = harness.node(obs_idx) { | ||
| total_queried += 1; | ||
| let timeout_dur = Duration::from_secs(30); | ||
| let query = p2p.dht().find_closest_nodes(target, K); | ||
| match tokio::time::timeout(timeout_dur, query).await { | ||
| Ok(Ok(peers)) if !peers.is_empty() => { |
There was a problem hiding this comment.
These per-observer DHT lookups are executed sequentially, each with a 30s timeout. If the DHT is unhealthy, the test can spend a long time waiting on timeouts back-to-back and stall CI. Consider running observer lookups in parallel (with a concurrency cap) and/or adding an overall test timeout to bound failure-mode runtime.
| if let Some(p2p) = harness.node(observer_idx) { | ||
| match tokio::time::timeout(DHT_LOOKUP_TIMEOUT, p2p.dht().find_closest_nodes(target, k)) | ||
| .await | ||
| { | ||
| Ok(Ok(peers)) => peers.iter().map(|p| p.peer_id.to_hex()).collect(), | ||
| _ => vec![], |
There was a problem hiding this comment.
dht_lookup uses a 30s timeout and is called in large nested loops (including 100-node tests). In failure modes this can make the suite take an extremely long time due to sequential timeouts. Consider reducing the timeout for these diagnostics, parallelizing lookups with a concurrency cap, and/or adding an overall per-test timeout.
src/storage/handler.rs
Outdated
| if !is_node_in_close_group(p2p, &address).await { | ||
| debug!("Rejecting PUT for {addr_hex}: this node is not in the close group"); | ||
| return ChunkPutResponse::Error(ProtocolError::Internal( | ||
| "This node is not in the close group for this address".to_string(), | ||
| )); |
There was a problem hiding this comment.
Close-group rejection is returned as ProtocolError::Internal, but this is an expected routing/ownership condition rather than an internal failure. Consider adding a dedicated protocol error/response for “not responsible / not in close group” so clients can distinguish it from server errors and retry against other close-group members.
| pub fn with_p2p_node(self, p2p_node: Arc<P2PNode>) -> Self { | ||
| // OnceLock is empty after construction, so set() always succeeds here. | ||
| let _ = self.p2p_node.set(p2p_node); | ||
| self |
There was a problem hiding this comment.
OnceLock::set returns Err if the value was already set, but the result is ignored here. If with_p2p_node is ever called on an AntProtocol that already had p2p_node configured, this would silently do nothing and make debugging wiring issues hard. Consider handling the Err (e.g., debug_assert!, returning self unchanged with a warn!, or changing the API to make double-setting impossible).
| let step = if nodes.len() > actual_lookups { | ||
| nodes.len() / actual_lookups | ||
| } else { | ||
| 1 | ||
| }; |
There was a problem hiding this comment.
This step computation can panic when actual_lookups == 0 (e.g., num_lookups == 0 or nodes is empty) due to division by zero. Add an early return when actual_lookups == 0 (and/or nodes.is_empty()) that returns an empty ConfirmedCloseGroup.
src/close_group.rs
Outdated
| // Trim to close group size | ||
| confirmed.truncate(CLOSE_GROUP_SIZE); |
There was a problem hiding this comment.
confirm_close_group takes a k parameter, but the returned members is always truncated to CLOSE_GROUP_SIZE, so callers can’t get more than 5 confirmed peers even if they request a larger set. Either truncate to k (or k.min(CLOSE_GROUP_SIZE) if that’s the intent), or remove k from the public API and document the fixed-size behavior.
| // Trim to close group size | |
| confirmed.truncate(CLOSE_GROUP_SIZE); | |
| // Trim to the requested size, but never exceed the close group size | |
| confirmed.truncate(k.min(CLOSE_GROUP_SIZE)); |
- Add NotInCloseGroup protocol error variant instead of using Internal - Add early return for empty nodes/lookups in confirm_close_group - Fix truncation to use k parameter instead of hardcoded CLOSE_GROUP_SIZE - Make is_node_in_close_group return false on undersized/failed lookups - Log warning on double-set of P2P node via OnceLock - Use is_some_and instead of map_or(false, ...) per clippy
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| min_discovered >= total_nodes / 4, | ||
| "Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \ | ||
| routing tables are severely incomplete" | ||
| ); | ||
|
|
||
| // Every node should have at least 3 direct connections | ||
| assert!( | ||
| min_peers >= 2, |
There was a problem hiding this comment.
The assertion threshold doesn’t match the preceding comment: the comment says each node should discover at least 50% of peers via DHT, but the code asserts only total_nodes / 4 (25%). Please align the comment and the threshold so the test’s intent is unambiguous.
| min_discovered >= total_nodes / 4, | |
| "Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \ | |
| routing tables are severely incomplete" | |
| ); | |
| // Every node should have at least 3 direct connections | |
| assert!( | |
| min_peers >= 2, | |
| min_discovered >= total_nodes / 2, | |
| "Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \ | |
| routing tables are severely incomplete" | |
| ); | |
| // Every node should have at least 3 direct connections | |
| assert!( | |
| min_peers >= 3, |
tests/e2e/close_group_stability.rs
Outdated
| routing tables are severely incomplete" | ||
| ); | ||
|
|
||
| // Every node should have at least 3 direct connections |
There was a problem hiding this comment.
The comment says "Every node should have at least 3 direct connections", but the assertion allows min_peers >= 2. Please make the comment and the assertion consistent (either tighten the assert to 3 or update the comment).
| // Every node should have at least 3 direct connections | |
| // Every node should have at least 2 direct connections |
src/close_group.rs
Outdated
| //! Multiple nodes independently look up the same address and the intersection | ||
| //! of their results forms the "confirmed" close group. |
There was a problem hiding this comment.
The module docs say the confirmed close group is formed by the intersection of lookup results, but confirm_close_group() actually uses an appearance-count threshold (quorum/majority), not strict intersection. This is misleading for callers and future maintainers—please update the docs to describe the threshold-based confirmation correctly.
| //! Multiple nodes independently look up the same address and the intersection | |
| //! of their results forms the "confirmed" close group. | |
| //! Multiple nodes independently look up the same address and aggregate the | |
| //! returned close groups by counting how often each peer appears. Peers that | |
| //! meet or exceed a configured appearance threshold (e.g. quorum/majority) | |
| //! form the "confirmed" close group. |
src/storage/handler.rs
Outdated
| // 4. Close group verification — check if this node is responsible | ||
| // for the address. This prevents storing data we're not close to, | ||
| // which is critical for merkle payment verification consistency. | ||
| if let Some(p2p) = self.p2p_node.get() { | ||
| if !is_node_in_close_group(p2p, &address).await { | ||
| debug!("Rejecting PUT for {addr_hex}: this node is not in the close group"); | ||
| return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup); | ||
| } | ||
| } |
There was a problem hiding this comment.
Close-group verification runs before payment verification. This makes unauthenticated/unpaid PUT attempts trigger an extra DHT lookup (amplifying DoS potential) and also changes the error precedence clients see (e.g., NotInCloseGroup instead of PaymentRequired/PaymentFailed). Consider moving the close-group check after you’ve determined the request is payable/paid, and ideally only enforcing it for Merkle proofs (SingleNode proofs likely shouldn’t require close-group membership).
- Replace direct indexing nodes[node_idx] with nodes.get() in confirm_close_group (project rule: no direct indexing) - Reorder PUT handler: payment verification (O(1) cache check) now runs before close group check (DHT lookup with 15s timeout) to prevent unpaid requests from triggering expensive DHT lookups (DoS vector)
- Comment said 50% but assertion was 25% (total_nodes / 4) -- fixed comment - Comment said 3 connections but assertion was >= 2 -- fixed comment - Module doc said "intersection" but code uses quorum/threshold -- fixed doc
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| ### Fix 3: Close group validation in PUT handler (`src/storage/handler.rs`) | ||
|
|
||
| `AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification. |
There was a problem hiding this comment.
The findings doc says the PUT handler performs the close-group check “before payment verification”, but src/storage/handler.rs now does the close-group lookup after verifying payment (to avoid unpaid DHT-lookup DoS). Please update this section to match the implemented order so the document doesn’t mislead future readers.
| `AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification. | |
| `AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, after successfully verifying payment, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification while avoiding unpaid DHT-lookup DoS vectors. |
| PaymentFailed(String), | ||
| /// Quote generation failed. | ||
| QuoteFailed(String), | ||
| /// This node is not in the close group for the requested address. | ||
| /// The client should retry with a different close-group member. | ||
| NotInCloseGroup, | ||
| /// Internal error. | ||
| Internal(String), |
There was a problem hiding this comment.
Adding ProtocolError::NotInCloseGroup changes the on-wire postcard/serde enum layout for ProtocolError. Older nodes/clients using the existing autonomi/ant/chunk/v1 protocol will fail to deserialize (or misinterpret) responses containing this new variant, and even the Internal variant’s index shifts. Please introduce explicit protocol versioning for this wire type (e.g., bump CHUNK_PROTOCOL_ID/PROTOCOL_VERSION and route accordingly, or redesign error encoding to be forward/backward compatible) before shipping this new error across the network.
| if let Some(p2p) = self.p2p_node.get() { | ||
| if !is_node_in_close_group(p2p, &address).await { | ||
| debug!("Rejecting PUT for {addr_hex}: this node is not in the close group"); | ||
| return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup); |
There was a problem hiding this comment.
The new close-group enforcement will reject PUTs on nodes that aren’t in the close group for address. This is a behavioral breaking change for any caller that PUTs to an arbitrary node (including existing E2E helpers like TestNode::store_chunk/remote store tests that don’t first choose a close-group member). Either update the call paths/tests to first select the close group via DHT and target those nodes, or gate this check behind an explicit mode/config/payment-proof type so legacy PUT flows keep working.
| if let Some(p2p) = self.p2p_node.get() { | |
| if !is_node_in_close_group(p2p, &address).await { | |
| debug!("Rejecting PUT for {addr_hex}: this node is not in the close group"); | |
| return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup); | |
| // | |
| // To avoid breaking legacy PUT flows that send to arbitrary nodes, | |
| // we only enforce close-group checks when a payment proof is supplied. | |
| if request.payment_proof.is_some() { | |
| if let Some(p2p) = self.p2p_node.get() { | |
| if !is_node_in_close_group(p2p, &address).await { | |
| debug!("Rejecting PUT for {addr_hex}: this node is not in the close group"); | |
| return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup); | |
| } |
| num_lookups: usize, | ||
| threshold: usize, | ||
| ) -> ConfirmedCloseGroup { | ||
| let actual_lookups = num_lookups.min(nodes.len()); | ||
|
|
||
| if actual_lookups == 0 || nodes.is_empty() { | ||
| return ConfirmedCloseGroup { | ||
| members: Vec::new(), | ||
| num_lookups: 0, | ||
| num_responses: 0, | ||
| threshold, | ||
| }; |
There was a problem hiding this comment.
confirm_close_group() accepts threshold without validation. If a caller passes threshold > actual_lookups, confirmation becomes impossible and the function will always return an empty members list with no clear signal that the parameters are invalid. Consider clamping threshold to actual_lookups (or returning an explicit empty result with a warning/error) so misconfiguration doesn’t silently disable confirmation.
During merkle payment verification, nodes now check that the candidate nodes in the winner pool are actually the closest nodes to the data address. This prevents malicious clients from building winner pools with arbitrary (non-closest) nodes. The verification performs a DHT lookup for the data address, derives peer IDs from each candidate ML-DSA public key, and checks that a majority appear in the close group. On DHT failure, the check is skipped with a warning for availability during network instability.
Addresses critical review findings: 1. Quote handlers (handle_quote, handle_merkle_candidate_quote) now verify the node is in the close group before issuing quotes. This prevents the quote-pay-refuse failure path where a client pays a quoted node that later rejects the PUT with NotInCloseGroup. 2. is_node_in_close_group now requests K+1 peers and filters out self before comparing distances. This fixes the off-by-one where the DHT may include the querying node in results, making the comparison against only K-1 external peers. 3. Test assertion comments aligned with actual thresholds (comments now describe the threshold as a catastrophic-failure detector, with the VERDICT log showing ideal-bar status).
Replace the single-lookup is_node_in_close_group with a multi-lookup threshold-based approach. The function now performs 3 independent DHT lookups (each following different iterative paths through the network) and requires 2/3 to agree that this node is in the close group. This directly addresses the core problem: with routing tables only covering 13-17% of the network, a single lookup can follow a path that misses closer nodes. Multiple lookups traverse different parts of the routing graph and are far more likely to discover the true closest peers. The same consensus mechanism is now used in the live PUT path, quote handlers, and merkle candidate verification — not just in tests.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/close_group.rs
Outdated
| // If we couldn't retrieve enough external peers, we can't confirm | ||
| // responsibility — treat as "not in close group" so the PUT is | ||
| // rejected or retried rather than silently accepted. | ||
| if external_peers.len() < CLOSE_GROUP_SIZE { | ||
| warn!( | ||
| "is_node_in_close_group: only found {} external peers (need {CLOSE_GROUP_SIZE})", | ||
| external_peers.len() | ||
| ); | ||
| return false; | ||
| } |
There was a problem hiding this comment.
is_node_in_close_group returns false whenever fewer than CLOSE_GROUP_SIZE external peers are returned. In small networks (e.g. 5 nodes) find_closest_nodes can only return 4 external peers, so this will always mark the node as not-in-close-group and cause all Quote/PUT requests to be rejected with NotInCloseGroup once the P2P node is wired in. Consider handling small-network cases (e.g. if node.peer_count().await + 1 <= CLOSE_GROUP_SIZE, treat as in close group), or relaxing the minimum-external-peers requirement so close-group checks don’t brick early deployments.
| // Empty results — can't verify, log and allow | ||
| warn!( | ||
| "Cannot verify merkle candidates for {}: DHT returned empty results", | ||
| hex::encode(xorname) | ||
| ); | ||
| return Ok(()); | ||
| } | ||
| Ok(Err(e)) => { | ||
| warn!( | ||
| "Cannot verify merkle candidates for {}: DHT error: {e}", | ||
| hex::encode(xorname) | ||
| ); | ||
| return Ok(()); | ||
| } | ||
| Err(_) => { | ||
| warn!( | ||
| "Cannot verify merkle candidates for {}: DHT lookup timed out", | ||
| hex::encode(xorname) | ||
| ); | ||
| return Ok(()); |
There was a problem hiding this comment.
verify_candidates_are_closest currently returns Ok(()) (allowing verification to pass) when the DHT lookup is empty/errors/times out. Since this check is explicitly intended to prevent malicious clients from submitting non-close-group candidate pools, failing open here allows an attacker to bypass the protection by inducing DHT failures/timeouts (e.g. via load) at the verifier. Consider failing closed (returning an error so the client retries), or gating this behavior behind a config flag with a clear default for security vs availability.
| // Empty results — can't verify, log and allow | |
| warn!( | |
| "Cannot verify merkle candidates for {}: DHT returned empty results", | |
| hex::encode(xorname) | |
| ); | |
| return Ok(()); | |
| } | |
| Ok(Err(e)) => { | |
| warn!( | |
| "Cannot verify merkle candidates for {}: DHT error: {e}", | |
| hex::encode(xorname) | |
| ); | |
| return Ok(()); | |
| } | |
| Err(_) => { | |
| warn!( | |
| "Cannot verify merkle candidates for {}: DHT lookup timed out", | |
| hex::encode(xorname) | |
| ); | |
| return Ok(()); | |
| // Empty results — can't verify, fail closed so caller can retry | |
| warn!( | |
| "Cannot verify merkle candidates for {}: DHT returned empty results", | |
| hex::encode(xorname) | |
| ); | |
| return Err(Error::Payment(format!( | |
| "Unable to verify merkle candidates for {}: DHT returned empty results", | |
| hex::encode(xorname) | |
| ))); | |
| } | |
| Ok(Err(e)) => { | |
| warn!( | |
| "Cannot verify merkle candidates for {}: DHT error: {e}", | |
| hex::encode(xorname) | |
| ); | |
| return Err(Error::Payment(format!( | |
| "Unable to verify merkle candidates for {}: DHT error: {e}", | |
| hex::encode(xorname) | |
| ))); | |
| } | |
| Err(_) => { | |
| warn!( | |
| "Cannot verify merkle candidates for {}: DHT lookup timed out", | |
| hex::encode(xorname) | |
| ); | |
| return Err(Error::Payment(format!( | |
| "Unable to verify merkle candidates for {}: DHT lookup timed out", | |
| hex::encode(xorname) | |
| ))); |
| ### Fix 3: Close group validation in PUT handler (`src/storage/handler.rs`) | ||
|
|
||
| `AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification. | ||
|
|
||
| Wired in production (`src/node.rs`) via `with_p2p_node()` and in tests (`tests/e2e/testnet.rs`) via `set_p2p_node()`. |
There was a problem hiding this comment.
The findings doc says the PUT handler calls is_node_in_close_group() before payment verification, but the implementation in src/storage/handler.rs performs the close-group check after payment verification (to avoid DHT-lookup DoS). Please update this text to match the actual behavior so the documentation doesn’t mislead readers.
Documents the root cause (routing table not seeded from transport connections) and proposes 4 fixes in saorsa-core: 1. Seed DHT routing table from transport connections (critical) 2. Periodic k-bucket refresh (important) 3. Expose routing table contents via API (nice-to-have) 4. Local-only find_closest_nodes (important for performance) Includes measured data, expected impact, validation steps, and references to the original Kademlia paper.
The devnet path (src/devnet.rs) was missing the set_p2p_node call, meaning close group verification was silently skipped for all devnet nodes. Now all three node creation paths (production, devnet, test) wire the P2P node into the protocol handler and payment verifier.
Local-only find_closest_nodes would return inaccurate results with routing tables at 13-17% discovery -- the exact problem we are trying to fix. Only viable after routing table seeding is implemented.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Request K+1 peers because the DHT may include this node in results. | ||
| let lookup_k = CLOSE_GROUP_SIZE + 1; | ||
|
|
There was a problem hiding this comment.
is_node_in_close_group requests CLOSE_GROUP_SIZE + 1 peers and then computes the max distance across all external_peers. If the DHT result does not include this node, external_peers can contain K+1 peers, making the “furthest” distance correspond to the (K+1)th closest and causing false positives (a node just outside the close group may be accepted). Consider trimming to exactly CLOSE_GROUP_SIZE external peers (by XOR distance) before comparing, or otherwise ensure you’re comparing against the Kth-closest boundary excluding self.
| // Require at least a majority of candidates to be in the close group. | ||
| // We use CLOSE_GROUP_SIZE as the bar since the winner pool may have | ||
| // more candidates than the close group (e.g., 16 candidates for K=5). | ||
| let min_required = if candidates.len() >= CLOSE_GROUP_SIZE { | ||
| // At least CLOSE_GROUP_MAJORITY of the close group nodes | ||
| // should appear among the candidates | ||
| (CLOSE_GROUP_SIZE / 2) + 1 | ||
| } else { | ||
| // Small pool — require majority of what's there | ||
| (candidates.len() / 2) + 1 | ||
| }; | ||
|
|
||
| if candidates_in_close_group < min_required { | ||
| return Err(Error::Payment(format!( | ||
| "Merkle candidate pool for {} has only {candidates_in_close_group} nodes \ | ||
| in the close group (need {min_required}). Candidates may not be the \ | ||
| actual closest nodes to this address.", | ||
| hex::encode(xorname) |
There was a problem hiding this comment.
The doc/comment says “majority of the candidates” must be in the close group, but the computed min_required is based on CLOSE_GROUP_SIZE when candidates.len() >= CLOSE_GROUP_SIZE (e.g., only 3 required for a 16-candidate pool). Either tighten the threshold to match the stated intent (majority of candidates / all paid recipients), or update the docs/comments/error message so the security model is unambiguous.
| // Require at least a majority of candidates to be in the close group. | |
| // We use CLOSE_GROUP_SIZE as the bar since the winner pool may have | |
| // more candidates than the close group (e.g., 16 candidates for K=5). | |
| let min_required = if candidates.len() >= CLOSE_GROUP_SIZE { | |
| // At least CLOSE_GROUP_MAJORITY of the close group nodes | |
| // should appear among the candidates | |
| (CLOSE_GROUP_SIZE / 2) + 1 | |
| } else { | |
| // Small pool — require majority of what's there | |
| (candidates.len() / 2) + 1 | |
| }; | |
| if candidates_in_close_group < min_required { | |
| return Err(Error::Payment(format!( | |
| "Merkle candidate pool for {} has only {candidates_in_close_group} nodes \ | |
| in the close group (need {min_required}). Candidates may not be the \ | |
| actual closest nodes to this address.", | |
| hex::encode(xorname) | |
| // Require that a majority of the relevant nodes are in the close group. | |
| // When there are at least CLOSE_GROUP_SIZE candidates, we require a | |
| // majority of the close group itself to appear among the candidates. | |
| // When there are fewer candidates than CLOSE_GROUP_SIZE, we require a | |
| // majority of the available candidates. | |
| let min_required = if candidates.len() >= CLOSE_GROUP_SIZE { | |
| // At least a simple majority of the close group nodes should appear | |
| // among the candidates. | |
| (CLOSE_GROUP_SIZE / 2) + 1 | |
| } else { | |
| // Small pool — require majority of what's there. | |
| (candidates.len() / 2) + 1 | |
| }; | |
| if candidates_in_close_group < min_required { | |
| return Err(Error::Payment(format!( | |
| "Merkle candidate pool for {} has only {candidates_in_close_group} nodes \ | |
| in the close group (need at least {min_required} based on close group \ | |
| size {close_group_size} and {candidates_len} candidates). Candidates \ | |
| may not be the actual closest nodes to this address.", | |
| hex::encode(xorname), | |
| close_group_size = CLOSE_GROUP_SIZE, | |
| candidates_len = candidates.len(), |
| // Verify this node is in the close group for the address before | ||
| // issuing a quote. This prevents clients from paying nodes that | ||
| // will later refuse the PUT with NotInCloseGroup. | ||
| if let Some(p2p) = self.p2p_node.get() { | ||
| if !is_node_in_close_group(p2p, &request.address).await { |
There was a problem hiding this comment.
handle_quote now performs close-group verification via is_node_in_close_group, which triggers multiple DHT lookups per unauthenticated/unpaid quote request (and can block up to the lookup timeout). This increases quote latency and creates an easy DHT-amplification/DoS vector. Consider adding caching (per address and/or per peer), rate limiting, a cheaper/local check for the quote path, and/or shorter/parallelized lookups specifically for quote handling.
| // Verify this node is in the close group for the address before | ||
| // issuing a merkle candidate quote. | ||
| if let Some(p2p) = self.p2p_node.get() { | ||
| if !is_node_in_close_group(p2p, &request.address).await { | ||
| debug!("Refusing merkle candidate quote for {addr_hex}: not in close group"); |
There was a problem hiding this comment.
handle_merkle_candidate_quote also runs is_node_in_close_group (multiple DHT lookups) before issuing a candidate quote. Since these requests are likewise unauthenticated/unpaid, this can be abused to generate sustained DHT traffic and add significant latency to the quoting flow. Consider applying the same mitigations as for handle_quote (caching/rate limiting/cheaper check/shorter timeouts).
|
Closing for now, @mickvandijke feel free to re-open |
Summary
Key Findings
Geo-location confirmed NOT the culprit -- problem persists with all diversity checks disabled. The core bottleneck is in saorsa-core's DHT routing table population.
Changes
src/node.rs,tests/e2e/testnet.rs) -- periodic random lookups every 30s (prod) / 10s (tests)src/close_group.rs) --confirm_close_group()andis_node_in_close_group()utilitiessrc/storage/handler.rs) --AntProtocolchecks close group membership before accepting storagetests/e2e/close_group_stability.rs) -- 8 tests at 5/25/100 node scalesdocs/close_group_stability_findings.md) -- full before/after analysisTest plan
cargo buildpassescargo clippy --all-featurespasses (no errors)cargo test --test e2e --features test-utils close_group_stability -- --nocapture