feat: close group stability for merkle payment verification by grumbach · Pull Request #45 · WithAutonomi/ant-node

grumbach · 2026-03-26T08:17:53Z

Summary

Root cause identified: DHT routing tables are severely underpopulated -- nodes discover only ~13% of the network via DHT despite being connected to ~90% via transport. This causes close group disagreements at scale which would break merkle payment verification.
Three fixes implemented: DHT routing table refresh background task, close group confirmation protocol, and close group validation in PUT handler.
Comprehensive test suite: 8 tests covering routing table completeness, ground truth overlap, quoting-vs-verification agreement, and temporal stability at 5/25/100 node scales.

Key Findings

Metric	Before	After
25-node quoting-verification Jaccard	0.983	1.000
25-node exact matches	19/20	20/20
100-node routing table discovery	13.3%	16.7%
100-node min ground truth overlap	0.520	0.660
100-node targets with majority overlap	19/20	20/20

Geo-location confirmed NOT the culprit -- problem persists with all diversity checks disabled. The core bottleneck is in saorsa-core's DHT routing table population.

Changes

DHT routing table refresh (src/node.rs, tests/e2e/testnet.rs) -- periodic random lookups every 30s (prod) / 10s (tests)
Close group confirmation protocol (src/close_group.rs) -- confirm_close_group() and is_node_in_close_group() utilities
Close group validation in PUT (src/storage/handler.rs) -- AntProtocol checks close group membership before accepting storage
Test suite (tests/e2e/close_group_stability.rs) -- 8 tests at 5/25/100 node scales
Findings doc (docs/close_group_stability_findings.md) -- full before/after analysis

Test plan

cargo build passes
cargo clippy --all-features passes (no errors)
5-node close group stability: perfect 1.000
25-node quoting vs verification: perfect 1.000 (20/20 exact matches)
25-node ground truth overlap: 0.992+
100-node ground truth overlap: 0.877 (all 20/20 targets with majority)
100-node routing table discovery improved 13.3% to 16.7%
Temporal stability: 1.000 (no drift)
Run full test suite: cargo test --test e2e --features test-utils close_group_stability -- --nocapture

The DHT routing tables are severely underpopulated — nodes discover only ~13% of the network via DHT despite being connected to ~90% via transport. This causes close group disagreements at scale which would break merkle payment verification (client quoting and node verification return different closest-node sets). Changes: 1. DHT routing table refresh background task (src/node.rs, tests/e2e/testnet.rs) - Periodic random lookups every 30s (production) / 10s (tests) to keep routing tables populated - Improved routing table discovery from 13.3% to 16.7% at 100 nodes - At 25 nodes: quoting-vs-verification Jaccard improved from 0.983 to 1.000 2. Close group confirmation protocol (src/close_group.rs) - confirm_close_group(): multi-node consensus on close group membership - is_node_in_close_group(): self-check for close group responsibility - Sorted by XOR distance, threshold-based filtering 3. Close group validation in PUT handler (src/storage/handler.rs) - AntProtocol checks if node is in close group before accepting PUT - Uses OnceLock<Arc<P2PNode>> for deferred initialization - Prevents storing data this node is not responsible for 4. Comprehensive test suite (tests/e2e/close_group_stability.rs) - 8 tests: routing table completeness, ground truth overlap, quoting-vs-verification agreement, temporal stability, result sizing - Tests at 5, 25, and 100 node scales - Detailed metrics output for diagnosis 5. Findings document (docs/close_group_stability_findings.md) - Before/after comparison of all metrics - Root cause analysis confirming DHT population is the primary issue - Geo-location confirmed NOT the culprit in test environment

grumbach · 2026-03-26T08:19:09Z

@claude @greptile review please

Add allow directives for clippy lints that are acceptable in test code (cast_precision_loss, too_many_lines, doc_markdown, uninlined_format_args, items_after_statements). Remove unused per_target_results and response_rate variables.

Copilot

Pull request overview

This PR improves close-group stability to support merkle payment verification by proactively refreshing DHT routing tables, adding utilities to confirm close-group membership, enforcing (optional) close-group validation on PUT, and introducing new large-scale E2E diagnostics/tests.

Changes:

Add periodic DHT routing-table refresh tasks in production nodes and the E2E testnet harness.
Add close_group utilities (confirm_close_group, is_node_in_close_group) and wire P2P node access into AntProtocol for PUT-time validation.
Add new E2E test modules for close-group stability and 100-node network convergence, plus a findings document.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
`src/node.rs`	Wraps P2P node in `Arc`, wires it into `AntProtocol`, and starts/stops a periodic DHT refresh task.
`src/storage/handler.rs`	Adds optional P2P node plumbing via `OnceLock` and enforces close-group membership before payment verification on PUT.
`src/close_group.rs`	New close-group confirmation / membership utilities using DHT lookups.
`src/lib.rs`	Exposes the new `close_group` module.
`tests/e2e/testnet.rs`	Wires P2P node into `AntProtocol` in the harness and adds a per-node DHT refresh background task.
`tests/e2e/close_group_stability.rs`	New E2E suite measuring routing completeness, ground-truth overlap, quoting-vs-verification agreement, and temporal stability at 5/25/100 nodes.
`tests/e2e/network_convergence.rs`	New 100-node DHT “lookup path convergence” test and warmup routines.
`tests/e2e/mod.rs`	Registers the new E2E test modules.
`docs/close_group_stability_findings.md`	Documents before/after metrics, root causes, and implemented mitigations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T08:21:51Z

src/close_group.rs

+            // Check if we are closer than or equal to the furthest peer in the group
+            if peers.len() < CLOSE_GROUP_SIZE {
+                // Not enough peers found — we are likely in the close group
+                return true;
+            }
+
+            // Check if we're closer than the furthest member
+            let furthest_distance = peers
+                .iter()
+                .filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex()))
+                .map(|xor| xor_distance(target, &xor))
+                .max();
+
+            furthest_distance.map_or(true, |furthest| my_distance <= furthest)
+        }
+        _ => {
+            // Lookup failed — can't determine, assume we're in to avoid
+            // rejecting valid payments
+            true


Returning true when peers.len() < CLOSE_GROUP_SIZE is very permissive and can cause handle_put to accept storage even when the node couldn’t retrieve a full close group (which may indicate incomplete routing rather than responsibility). Consider treating undersized results as “cannot confirm” and returning false/Err so the PUT can be rejected or retried.

Suggested change

// Check if we are closer than or equal to the furthest peer in the group

if peers.len() < CLOSE_GROUP_SIZE {

// Not enough peers found — we are likely in the close group

return true;

}

// Check if we're closer than the furthest member

let furthest_distance = peers

.iter()

.filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex()))

.map(|xor| xor_distance(target, &xor))

.max();

furthest_distance.map_or(true, |furthest| my_distance <= furthest)

}

_ => {

// Lookup failed — can't determine, assume we're in to avoid

// rejecting valid payments

true

// We can only confirm responsibility if we obtained a full close group.

// Undersized results may indicate incomplete routing, so treat as

// "cannot confirm" rather than assuming responsibility.

if peers.len() < CLOSE_GROUP_SIZE {

return false;

}

// Check if we're closer than or equal to the furthest member

let furthest_distance = peers

.iter()

.filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex()))

.map(|xor| xor_distance(target, &xor))

.max();

// If we cannot compute any distances, we cannot confirm responsibility.

furthest_distance.map_or(false, |furthest| my_distance <= furthest)

}

_ => {

// Lookup failed — can't determine, so do NOT confirm responsibility.

false

Copilot · 2026-03-26T08:21:52Z

src/close_group.rs

+        _ => {
+            // Lookup failed — can't determine, assume we're in to avoid
+            // rejecting valid payments
+            true
+        }


On lookup failure/timeout this function returns true, which means close-group validation in PUT becomes a no-op whenever the DHT is unhealthy (exactly the scenario this is meant to guard). Consider returning false (or a Result so callers can return a retryable error) and logging at warn level to aid diagnosis.

Suggested change

_ => {

// Lookup failed — can't determine, assume we're in to avoid

// rejecting valid payments

true

}

Ok(Err(err)) => {

// Lookup failed — treat as not in close group and log for diagnosis.

warn!(

?err,

"DHT close group lookup failed while checking node responsibility; \

treating node as not in close group"

);

false

}

Err(elapsed) => {

// Lookup timed out — treat as not in close group and log for diagnosis.

warn!(

?elapsed,

"DHT close group lookup timed out after {:?} while checking node responsibility; \

treating node as not in close group",

CONFIRMATION_LOOKUP_TIMEOUT

);

false

}

Copilot · 2026-03-26T08:21:52Z

tests/e2e/network_convergence.rs

+    for &obs_idx in observers {
+        if let Some(p2p) = harness.node(obs_idx) {
+            total_queried += 1;
+            let timeout_dur = Duration::from_secs(30);
+            let query = p2p.dht().find_closest_nodes(target, K);
+            match tokio::time::timeout(timeout_dur, query).await {
+                Ok(Ok(peers)) if !peers.is_empty() => {


These per-observer DHT lookups are executed sequentially, each with a 30s timeout. If the DHT is unhealthy, the test can spend a long time waiting on timeouts back-to-back and stall CI. Consider running observer lookups in parallel (with a concurrency cap) and/or adding an overall test timeout to bound failure-mode runtime.

Copilot · 2026-03-26T08:21:52Z

tests/e2e/close_group_stability.rs

+    if let Some(p2p) = harness.node(observer_idx) {
+        match tokio::time::timeout(DHT_LOOKUP_TIMEOUT, p2p.dht().find_closest_nodes(target, k))
+            .await
+        {
+            Ok(Ok(peers)) => peers.iter().map(|p| p.peer_id.to_hex()).collect(),
+            _ => vec![],


dht_lookup uses a 30s timeout and is called in large nested loops (including 100-node tests). In failure modes this can make the suite take an extremely long time due to sequential timeouts. Consider reducing the timeout for these diagnostics, parallelizing lookups with a concurrency cap, and/or adding an overall per-test timeout.

Copilot · 2026-03-26T08:21:53Z

src/storage/handler.rs

+            if !is_node_in_close_group(p2p, &address).await {
+                debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
+                return ChunkPutResponse::Error(ProtocolError::Internal(
+                    "This node is not in the close group for this address".to_string(),
+                ));


Close-group rejection is returned as ProtocolError::Internal, but this is an expected routing/ownership condition rather than an internal failure. Consider adding a dedicated protocol error/response for “not responsible / not in close group” so clients can distinguish it from server errors and retry against other close-group members.

Copilot · 2026-03-26T08:21:53Z

src/storage/handler.rs

+    pub fn with_p2p_node(self, p2p_node: Arc<P2PNode>) -> Self {
+        // OnceLock is empty after construction, so set() always succeeds here.
+        let _ = self.p2p_node.set(p2p_node);
+        self


OnceLock::set returns Err if the value was already set, but the result is ignored here. If with_p2p_node is ever called on an AntProtocol that already had p2p_node configured, this would silently do nothing and make debugging wiring issues hard. Consider handling the Err (e.g., debug_assert!, returning self unchanged with a warn!, or changing the API to make double-setting impossible).

Copilot · 2026-03-26T08:21:53Z

src/close_group.rs

+    let step = if nodes.len() > actual_lookups {
+        nodes.len() / actual_lookups
+    } else {
+        1
+    };


This step computation can panic when actual_lookups == 0 (e.g., num_lookups == 0 or nodes is empty) due to division by zero. Add an early return when actual_lookups == 0 (and/or nodes.is_empty()) that returns an empty ConfirmedCloseGroup.

Copilot · 2026-03-26T08:21:53Z

src/close_group.rs

+    // Trim to close group size
+    confirmed.truncate(CLOSE_GROUP_SIZE);


confirm_close_group takes a k parameter, but the returned members is always truncated to CLOSE_GROUP_SIZE, so callers can’t get more than 5 confirmed peers even if they request a larger set. Either truncate to k (or k.min(CLOSE_GROUP_SIZE) if that’s the intent), or remove k from the public API and document the fixed-size behavior.

Suggested change

// Trim to close group size

confirmed.truncate(CLOSE_GROUP_SIZE);

// Trim to the requested size, but never exceed the close group size

confirmed.truncate(k.min(CLOSE_GROUP_SIZE));

- Add NotInCloseGroup protocol error variant instead of using Internal - Add early return for empty nodes/lookups in confirm_close_group - Fix truncation to use k parameter instead of hardcoded CLOSE_GROUP_SIZE - Make is_node_in_close_group return false on undersized/failed lookups - Log warning on double-set of P2P node via OnceLock - Use is_some_and instead of map_or(false, ...) per clippy

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T08:31:15Z

tests/e2e/close_group_stability.rs

+        min_discovered >= total_nodes / 4,
+        "Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \
+         routing tables are severely incomplete"
+    );
+
+    // Every node should have at least 3 direct connections
+    assert!(
+        min_peers >= 2,


The assertion threshold doesn’t match the preceding comment: the comment says each node should discover at least 50% of peers via DHT, but the code asserts only total_nodes / 4 (25%). Please align the comment and the threshold so the test’s intent is unambiguous.

Suggested change

min_discovered >= total_nodes / 4,

"Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \

routing tables are severely incomplete"

);

// Every node should have at least 3 direct connections

assert!(

min_peers >= 2,

min_discovered >= total_nodes / 2,

"Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \

routing tables are severely incomplete"

);

// Every node should have at least 3 direct connections

assert!(

min_peers >= 3,

Copilot · 2026-03-26T08:31:15Z

tests/e2e/close_group_stability.rs

+         routing tables are severely incomplete"
+    );
+
+    // Every node should have at least 3 direct connections


The comment says "Every node should have at least 3 direct connections", but the assertion allows min_peers >= 2. Please make the comment and the assertion consistent (either tighten the assert to 3 or update the comment).

Suggested change

// Every node should have at least 3 direct connections

// Every node should have at least 2 direct connections

Copilot · 2026-03-26T08:31:16Z

src/close_group.rs

+//! Multiple nodes independently look up the same address and the intersection
+//! of their results forms the "confirmed" close group.


The module docs say the confirmed close group is formed by the intersection of lookup results, but confirm_close_group() actually uses an appearance-count threshold (quorum/majority), not strict intersection. This is misleading for callers and future maintainers—please update the docs to describe the threshold-based confirmation correctly.

Suggested change

//! Multiple nodes independently look up the same address and the intersection

//! of their results forms the "confirmed" close group.

//! Multiple nodes independently look up the same address and aggregate the

//! returned close groups by counting how often each peer appears. Peers that

//! meet or exceed a configured appearance threshold (e.g. quorum/majority)

//! form the "confirmed" close group.

Copilot · 2026-03-26T08:31:16Z

src/storage/handler.rs

+        // 4. Close group verification — check if this node is responsible
+        //    for the address. This prevents storing data we're not close to,
+        //    which is critical for merkle payment verification consistency.
+        if let Some(p2p) = self.p2p_node.get() {
+            if !is_node_in_close_group(p2p, &address).await {
+                debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
+                return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);
+            }
+        }


Close-group verification runs before payment verification. This makes unauthenticated/unpaid PUT attempts trigger an extra DHT lookup (amplifying DoS potential) and also changes the error precedence clients see (e.g., NotInCloseGroup instead of PaymentRequired/PaymentFailed). Consider moving the close-group check after you’ve determined the request is payable/paid, and ideally only enforcing it for Merkle proofs (SingleNode proofs likely shouldn’t require close-group membership).

- Replace direct indexing nodes[node_idx] with nodes.get() in confirm_close_group (project rule: no direct indexing) - Reorder PUT handler: payment verification (O(1) cache check) now runs before close group check (DHT lookup with 15s timeout) to prevent unpaid requests from triggering expensive DHT lookups (DoS vector)

- Comment said 50% but assertion was 25% (total_nodes / 4) -- fixed comment - Comment said 3 connections but assertion was >= 2 -- fixed comment - Module doc said "intersection" but code uses quorum/threshold -- fixed doc

Copilot

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T08:41:51Z

docs/close_group_stability_findings.md

+
+### Fix 3: Close group validation in PUT handler (`src/storage/handler.rs`)
+
+`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification.


The findings doc says the PUT handler performs the close-group check “before payment verification”, but src/storage/handler.rs now does the close-group lookup after verifying payment (to avoid unpaid DHT-lookup DoS). Please update this section to match the implemented order so the document doesn’t mislead future readers.

Suggested change

`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification.

`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, after successfully verifying payment, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification while avoiding unpaid DHT-lookup DoS vectors.

Copilot · 2026-03-26T08:41:51Z

src/ant_protocol/chunk.rs

    PaymentFailed(String),
    /// Quote generation failed.
    QuoteFailed(String),
+    /// This node is not in the close group for the requested address.
+    /// The client should retry with a different close-group member.
+    NotInCloseGroup,
    /// Internal error.
    Internal(String),


Adding ProtocolError::NotInCloseGroup changes the on-wire postcard/serde enum layout for ProtocolError. Older nodes/clients using the existing autonomi/ant/chunk/v1 protocol will fail to deserialize (or misinterpret) responses containing this new variant, and even the Internal variant’s index shifts. Please introduce explicit protocol versioning for this wire type (e.g., bump CHUNK_PROTOCOL_ID/PROTOCOL_VERSION and route accordingly, or redesign error encoding to be forward/backward compatible) before shipping this new error across the network.

Copilot · 2026-03-26T08:41:52Z

src/storage/handler.rs

+        if let Some(p2p) = self.p2p_node.get() {
+            if !is_node_in_close_group(p2p, &address).await {
+                debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
+                return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);


The new close-group enforcement will reject PUTs on nodes that aren’t in the close group for address. This is a behavioral breaking change for any caller that PUTs to an arbitrary node (including existing E2E helpers like TestNode::store_chunk/remote store tests that don’t first choose a close-group member). Either update the call paths/tests to first select the close group via DHT and target those nodes, or gate this check behind an explicit mode/config/payment-proof type so legacy PUT flows keep working.

Suggested change

if let Some(p2p) = self.p2p_node.get() {

if !is_node_in_close_group(p2p, &address).await {

debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");

return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);

//

// To avoid breaking legacy PUT flows that send to arbitrary nodes,

// we only enforce close-group checks when a payment proof is supplied.

if request.payment_proof.is_some() {

if let Some(p2p) = self.p2p_node.get() {

if !is_node_in_close_group(p2p, &address).await {

debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");

return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);

}

Copilot · 2026-03-26T08:41:52Z

src/close_group.rs

+    num_lookups: usize,
+    threshold: usize,
+) -> ConfirmedCloseGroup {
+    let actual_lookups = num_lookups.min(nodes.len());
+
+    if actual_lookups == 0 || nodes.is_empty() {
+        return ConfirmedCloseGroup {
+            members: Vec::new(),
+            num_lookups: 0,
+            num_responses: 0,
+            threshold,
+        };


confirm_close_group() accepts threshold without validation. If a caller passes threshold > actual_lookups, confirmation becomes impossible and the function will always return an empty members list with no clear signal that the parameters are invalid. Consider clamping threshold to actual_lookups (or returning an explicit empty result with a warning/error) so misconfiguration doesn’t silently disable confirmation.

During merkle payment verification, nodes now check that the candidate nodes in the winner pool are actually the closest nodes to the data address. This prevents malicious clients from building winner pools with arbitrary (non-closest) nodes. The verification performs a DHT lookup for the data address, derives peer IDs from each candidate ML-DSA public key, and checks that a majority appear in the close group. On DHT failure, the check is skipped with a warning for availability during network instability.

Addresses critical review findings: 1. Quote handlers (handle_quote, handle_merkle_candidate_quote) now verify the node is in the close group before issuing quotes. This prevents the quote-pay-refuse failure path where a client pays a quoted node that later rejects the PUT with NotInCloseGroup. 2. is_node_in_close_group now requests K+1 peers and filters out self before comparing distances. This fixes the off-by-one where the DHT may include the querying node in results, making the comparison against only K-1 external peers. 3. Test assertion comments aligned with actual thresholds (comments now describe the threshold as a catastrophic-failure detector, with the VERDICT log showing ideal-bar status).

Replace the single-lookup is_node_in_close_group with a multi-lookup threshold-based approach. The function now performs 3 independent DHT lookups (each following different iterative paths through the network) and requires 2/3 to agree that this node is in the close group. This directly addresses the core problem: with routing tables only covering 13-17% of the network, a single lookup can follow a path that misses closer nodes. Multiple lookups traverse different parts of the routing graph and are far more likely to discover the true closest peers. The same consensus mechanism is now used in the live PUT path, quote handlers, and merkle candidate verification — not just in tests.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T08:52:38Z

src/close_group.rs

+            // If we couldn't retrieve enough external peers, we can't confirm
+            // responsibility — treat as "not in close group" so the PUT is
+            // rejected or retried rather than silently accepted.
+            if external_peers.len() < CLOSE_GROUP_SIZE {
+                warn!(
+                    "is_node_in_close_group: only found {} external peers (need {CLOSE_GROUP_SIZE})",
+                    external_peers.len()
+                );
+                return false;
+            }


is_node_in_close_group returns false whenever fewer than CLOSE_GROUP_SIZE external peers are returned. In small networks (e.g. 5 nodes) find_closest_nodes can only return 4 external peers, so this will always mark the node as not-in-close-group and cause all Quote/PUT requests to be rejected with NotInCloseGroup once the P2P node is wired in. Consider handling small-network cases (e.g. if node.peer_count().await + 1 <= CLOSE_GROUP_SIZE, treat as in close group), or relaxing the minimum-external-peers requirement so close-group checks don’t brick early deployments.

Copilot · 2026-03-26T08:52:38Z

src/payment/verifier.rs

+                // Empty results — can't verify, log and allow
+                warn!(
+                    "Cannot verify merkle candidates for {}: DHT returned empty results",
+                    hex::encode(xorname)
+                );
+                return Ok(());
+            }
+            Ok(Err(e)) => {
+                warn!(
+                    "Cannot verify merkle candidates for {}: DHT error: {e}",
+                    hex::encode(xorname)
+                );
+                return Ok(());
+            }
+            Err(_) => {
+                warn!(
+                    "Cannot verify merkle candidates for {}: DHT lookup timed out",
+                    hex::encode(xorname)
+                );
+                return Ok(());


verify_candidates_are_closest currently returns Ok(()) (allowing verification to pass) when the DHT lookup is empty/errors/times out. Since this check is explicitly intended to prevent malicious clients from submitting non-close-group candidate pools, failing open here allows an attacker to bypass the protection by inducing DHT failures/timeouts (e.g. via load) at the verifier. Consider failing closed (returning an error so the client retries), or gating this behavior behind a config flag with a clear default for security vs availability.

Suggested change

// Empty results — can't verify, log and allow

warn!(

"Cannot verify merkle candidates for {}: DHT returned empty results",

hex::encode(xorname)

);

return Ok(());

}

Ok(Err(e)) => {

warn!(

"Cannot verify merkle candidates for {}: DHT error: {e}",

hex::encode(xorname)

);

return Ok(());

}

Err(_) => {

warn!(

"Cannot verify merkle candidates for {}: DHT lookup timed out",

hex::encode(xorname)

);

return Ok(());

// Empty results — can't verify, fail closed so caller can retry

warn!(

"Cannot verify merkle candidates for {}: DHT returned empty results",

hex::encode(xorname)

);

return Err(Error::Payment(format!(

"Unable to verify merkle candidates for {}: DHT returned empty results",

hex::encode(xorname)

)));

}

Ok(Err(e)) => {

warn!(

"Cannot verify merkle candidates for {}: DHT error: {e}",

hex::encode(xorname)

);

return Err(Error::Payment(format!(

"Unable to verify merkle candidates for {}: DHT error: {e}",

hex::encode(xorname)

)));

}

Err(_) => {

warn!(

"Cannot verify merkle candidates for {}: DHT lookup timed out",

hex::encode(xorname)

);

return Err(Error::Payment(format!(

"Unable to verify merkle candidates for {}: DHT lookup timed out",

hex::encode(xorname)

)));

Copilot · 2026-03-26T08:52:38Z

docs/close_group_stability_findings.md

+### Fix 3: Close group validation in PUT handler (`src/storage/handler.rs`)
+
+`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification.
+
+Wired in production (`src/node.rs`) via `with_p2p_node()` and in tests (`tests/e2e/testnet.rs`) via `set_p2p_node()`.


The findings doc says the PUT handler calls is_node_in_close_group() before payment verification, but the implementation in src/storage/handler.rs performs the close-group check after payment verification (to avoid DHT-lookup DoS). Please update this text to match the actual behavior so the documentation doesn’t mislead readers.

Documents the root cause (routing table not seeded from transport connections) and proposes 4 fixes in saorsa-core: 1. Seed DHT routing table from transport connections (critical) 2. Periodic k-bucket refresh (important) 3. Expose routing table contents via API (nice-to-have) 4. Local-only find_closest_nodes (important for performance) Includes measured data, expected impact, validation steps, and references to the original Kademlia paper.

The devnet path (src/devnet.rs) was missing the set_p2p_node call, meaning close group verification was silently skipped for all devnet nodes. Now all three node creation paths (production, devnet, test) wire the P2P node into the protocol handler and payment verifier.

Local-only find_closest_nodes would return inaccurate results with routing tables at 13-17% discovery -- the exact problem we are trying to fix. Only viable after routing table seeding is implemented.

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T09:01:45Z

src/close_group.rs

+    // Request K+1 peers because the DHT may include this node in results.
+    let lookup_k = CLOSE_GROUP_SIZE + 1;
+


is_node_in_close_group requests CLOSE_GROUP_SIZE + 1 peers and then computes the max distance across all external_peers. If the DHT result does not include this node, external_peers can contain K+1 peers, making the “furthest” distance correspond to the (K+1)th closest and causing false positives (a node just outside the close group may be accepted). Consider trimming to exactly CLOSE_GROUP_SIZE external peers (by XOR distance) before comparing, or otherwise ensure you’re comparing against the Kth-closest boundary excluding self.

Copilot · 2026-03-26T09:01:46Z

src/payment/verifier.rs

+        // Require at least a majority of candidates to be in the close group.
+        // We use CLOSE_GROUP_SIZE as the bar since the winner pool may have
+        // more candidates than the close group (e.g., 16 candidates for K=5).
+        let min_required = if candidates.len() >= CLOSE_GROUP_SIZE {
+            // At least CLOSE_GROUP_MAJORITY of the close group nodes
+            // should appear among the candidates
+            (CLOSE_GROUP_SIZE / 2) + 1
+        } else {
+            // Small pool — require majority of what's there
+            (candidates.len() / 2) + 1
+        };
+
+        if candidates_in_close_group < min_required {
+            return Err(Error::Payment(format!(
+                "Merkle candidate pool for {} has only {candidates_in_close_group} nodes \
+                 in the close group (need {min_required}). Candidates may not be the \
+                 actual closest nodes to this address.",
+                hex::encode(xorname)


The doc/comment says “majority of the candidates” must be in the close group, but the computed min_required is based on CLOSE_GROUP_SIZE when candidates.len() >= CLOSE_GROUP_SIZE (e.g., only 3 required for a 16-candidate pool). Either tighten the threshold to match the stated intent (majority of candidates / all paid recipients), or update the docs/comments/error message so the security model is unambiguous.

Suggested change

// Require at least a majority of candidates to be in the close group.

// We use CLOSE_GROUP_SIZE as the bar since the winner pool may have

// more candidates than the close group (e.g., 16 candidates for K=5).

let min_required = if candidates.len() >= CLOSE_GROUP_SIZE {

// At least CLOSE_GROUP_MAJORITY of the close group nodes

// should appear among the candidates

(CLOSE_GROUP_SIZE / 2) + 1

} else {

// Small pool — require majority of what's there

(candidates.len() / 2) + 1

};

if candidates_in_close_group < min_required {

return Err(Error::Payment(format!(

"Merkle candidate pool for {} has only {candidates_in_close_group} nodes \

in the close group (need {min_required}). Candidates may not be the \

actual closest nodes to this address.",

hex::encode(xorname)

// Require that a majority of the relevant nodes are in the close group.

// When there are at least CLOSE_GROUP_SIZE candidates, we require a

// majority of the close group itself to appear among the candidates.

// When there are fewer candidates than CLOSE_GROUP_SIZE, we require a

// majority of the available candidates.

let min_required = if candidates.len() >= CLOSE_GROUP_SIZE {

// At least a simple majority of the close group nodes should appear

// among the candidates.

(CLOSE_GROUP_SIZE / 2) + 1

} else {

// Small pool — require majority of what's there.

(candidates.len() / 2) + 1

};

if candidates_in_close_group < min_required {

return Err(Error::Payment(format!(

"Merkle candidate pool for {} has only {candidates_in_close_group} nodes \

in the close group (need at least {min_required} based on close group \

size {close_group_size} and {candidates_len} candidates). Candidates \

may not be the actual closest nodes to this address.",

hex::encode(xorname),

close_group_size = CLOSE_GROUP_SIZE,

candidates_len = candidates.len(),

Copilot · 2026-03-26T09:01:46Z

src/storage/handler.rs

+        // Verify this node is in the close group for the address before
+        // issuing a quote. This prevents clients from paying nodes that
+        // will later refuse the PUT with NotInCloseGroup.
+        if let Some(p2p) = self.p2p_node.get() {
+            if !is_node_in_close_group(p2p, &request.address).await {


handle_quote now performs close-group verification via is_node_in_close_group, which triggers multiple DHT lookups per unauthenticated/unpaid quote request (and can block up to the lookup timeout). This increases quote latency and creates an easy DHT-amplification/DoS vector. Consider adding caching (per address and/or per peer), rate limiting, a cheaper/local check for the quote path, and/or shorter/parallelized lookups specifically for quote handling.

Copilot · 2026-03-26T09:01:46Z

src/storage/handler.rs

+        // Verify this node is in the close group for the address before
+        // issuing a merkle candidate quote.
+        if let Some(p2p) = self.p2p_node.get() {
+            if !is_node_in_close_group(p2p, &request.address).await {
+                debug!("Refusing merkle candidate quote for {addr_hex}: not in close group");


handle_merkle_candidate_quote also runs is_node_in_close_group (multiple DHT lookups) before issuing a candidate quote. Since these requests are likewise unauthenticated/unpaid, this can be abused to generate sustained DHT traffic and add significant latency to the quoting flow. Consider applying the same mitigations as for handle_quote (caching/rate limiting/cheaper check/shorter timeouts).

grumbach · 2026-03-27T09:57:54Z

Closing for now, @mickvandijke feel free to re-open

Copilot AI review requested due to automatic review settings March 26, 2026 08:17

Copilot started reviewing on behalf of grumbach March 26, 2026 08:18 View session

fix: resolve CI clippy warnings in test files

99d5d7e

Add allow directives for clippy lints that are acceptable in test code (cast_precision_loss, too_many_lines, doc_markdown, uninlined_format_args, items_after_statements). Remove unused per_target_results and response_rate variables.