Skip to content

feat: close group stability for merkle payment verification#45

Closed
grumbach wants to merge 12 commits intomainfrom
feat/close-group-stability
Closed

feat: close group stability for merkle payment verification#45
grumbach wants to merge 12 commits intomainfrom
feat/close-group-stability

Conversation

@grumbach
Copy link
Copy Markdown
Collaborator

Summary

  • Root cause identified: DHT routing tables are severely underpopulated -- nodes discover only ~13% of the network via DHT despite being connected to ~90% via transport. This causes close group disagreements at scale which would break merkle payment verification.
  • Three fixes implemented: DHT routing table refresh background task, close group confirmation protocol, and close group validation in PUT handler.
  • Comprehensive test suite: 8 tests covering routing table completeness, ground truth overlap, quoting-vs-verification agreement, and temporal stability at 5/25/100 node scales.

Key Findings

Metric Before After
25-node quoting-verification Jaccard 0.983 1.000
25-node exact matches 19/20 20/20
100-node routing table discovery 13.3% 16.7%
100-node min ground truth overlap 0.520 0.660
100-node targets with majority overlap 19/20 20/20

Geo-location confirmed NOT the culprit -- problem persists with all diversity checks disabled. The core bottleneck is in saorsa-core's DHT routing table population.

Changes

  1. DHT routing table refresh (src/node.rs, tests/e2e/testnet.rs) -- periodic random lookups every 30s (prod) / 10s (tests)
  2. Close group confirmation protocol (src/close_group.rs) -- confirm_close_group() and is_node_in_close_group() utilities
  3. Close group validation in PUT (src/storage/handler.rs) -- AntProtocol checks close group membership before accepting storage
  4. Test suite (tests/e2e/close_group_stability.rs) -- 8 tests at 5/25/100 node scales
  5. Findings doc (docs/close_group_stability_findings.md) -- full before/after analysis

Test plan

  • cargo build passes
  • cargo clippy --all-features passes (no errors)
  • 5-node close group stability: perfect 1.000
  • 25-node quoting vs verification: perfect 1.000 (20/20 exact matches)
  • 25-node ground truth overlap: 0.992+
  • 100-node ground truth overlap: 0.877 (all 20/20 targets with majority)
  • 100-node routing table discovery improved 13.3% to 16.7%
  • Temporal stability: 1.000 (no drift)
  • Run full test suite: cargo test --test e2e --features test-utils close_group_stability -- --nocapture

The DHT routing tables are severely underpopulated — nodes discover only
~13% of the network via DHT despite being connected to ~90% via transport.
This causes close group disagreements at scale which would break merkle
payment verification (client quoting and node verification return different
closest-node sets).

Changes:

1. DHT routing table refresh background task (src/node.rs, tests/e2e/testnet.rs)
   - Periodic random lookups every 30s (production) / 10s (tests) to keep
     routing tables populated
   - Improved routing table discovery from 13.3% to 16.7% at 100 nodes
   - At 25 nodes: quoting-vs-verification Jaccard improved from 0.983 to 1.000

2. Close group confirmation protocol (src/close_group.rs)
   - confirm_close_group(): multi-node consensus on close group membership
   - is_node_in_close_group(): self-check for close group responsibility
   - Sorted by XOR distance, threshold-based filtering

3. Close group validation in PUT handler (src/storage/handler.rs)
   - AntProtocol checks if node is in close group before accepting PUT
   - Uses OnceLock<Arc<P2PNode>> for deferred initialization
   - Prevents storing data this node is not responsible for

4. Comprehensive test suite (tests/e2e/close_group_stability.rs)
   - 8 tests: routing table completeness, ground truth overlap,
     quoting-vs-verification agreement, temporal stability, result sizing
   - Tests at 5, 25, and 100 node scales
   - Detailed metrics output for diagnosis

5. Findings document (docs/close_group_stability_findings.md)
   - Before/after comparison of all metrics
   - Root cause analysis confirming DHT population is the primary issue
   - Geo-location confirmed NOT the culprit in test environment
Copilot AI review requested due to automatic review settings March 26, 2026 08:17
@grumbach
Copy link
Copy Markdown
Collaborator Author

@claude @greptile review please

Add allow directives for clippy lints that are acceptable in test code
(cast_precision_loss, too_many_lines, doc_markdown, uninlined_format_args,
items_after_statements). Remove unused per_target_results and response_rate
variables.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves close-group stability to support merkle payment verification by proactively refreshing DHT routing tables, adding utilities to confirm close-group membership, enforcing (optional) close-group validation on PUT, and introducing new large-scale E2E diagnostics/tests.

Changes:

  • Add periodic DHT routing-table refresh tasks in production nodes and the E2E testnet harness.
  • Add close_group utilities (confirm_close_group, is_node_in_close_group) and wire P2P node access into AntProtocol for PUT-time validation.
  • Add new E2E test modules for close-group stability and 100-node network convergence, plus a findings document.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/node.rs Wraps P2P node in Arc, wires it into AntProtocol, and starts/stops a periodic DHT refresh task.
src/storage/handler.rs Adds optional P2P node plumbing via OnceLock and enforces close-group membership before payment verification on PUT.
src/close_group.rs New close-group confirmation / membership utilities using DHT lookups.
src/lib.rs Exposes the new close_group module.
tests/e2e/testnet.rs Wires P2P node into AntProtocol in the harness and adds a per-node DHT refresh background task.
tests/e2e/close_group_stability.rs New E2E suite measuring routing completeness, ground-truth overlap, quoting-vs-verification agreement, and temporal stability at 5/25/100 nodes.
tests/e2e/network_convergence.rs New 100-node DHT “lookup path convergence” test and warmup routines.
tests/e2e/mod.rs Registers the new E2E test modules.
docs/close_group_stability_findings.md Documents before/after metrics, root causes, and implemented mitigations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +181 to +199
// Check if we are closer than or equal to the furthest peer in the group
if peers.len() < CLOSE_GROUP_SIZE {
// Not enough peers found — we are likely in the close group
return true;
}

// Check if we're closer than the furthest member
let furthest_distance = peers
.iter()
.filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex()))
.map(|xor| xor_distance(target, &xor))
.max();

furthest_distance.map_or(true, |furthest| my_distance <= furthest)
}
_ => {
// Lookup failed — can't determine, assume we're in to avoid
// rejecting valid payments
true
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning true when peers.len() < CLOSE_GROUP_SIZE is very permissive and can cause handle_put to accept storage even when the node couldn’t retrieve a full close group (which may indicate incomplete routing rather than responsibility). Consider treating undersized results as “cannot confirm” and returning false/Err so the PUT can be rejected or retried.

Suggested change
// Check if we are closer than or equal to the furthest peer in the group
if peers.len() < CLOSE_GROUP_SIZE {
// Not enough peers found — we are likely in the close group
return true;
}
// Check if we're closer than the furthest member
let furthest_distance = peers
.iter()
.filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex()))
.map(|xor| xor_distance(target, &xor))
.max();
furthest_distance.map_or(true, |furthest| my_distance <= furthest)
}
_ => {
// Lookup failed — can't determine, assume we're in to avoid
// rejecting valid payments
true
// We can only confirm responsibility if we obtained a full close group.
// Undersized results may indicate incomplete routing, so treat as
// "cannot confirm" rather than assuming responsibility.
if peers.len() < CLOSE_GROUP_SIZE {
return false;
}
// Check if we're closer than or equal to the furthest member
let furthest_distance = peers
.iter()
.filter_map(|p| peer_id_to_xor_name(&p.peer_id.to_hex()))
.map(|xor| xor_distance(target, &xor))
.max();
// If we cannot compute any distances, we cannot confirm responsibility.
furthest_distance.map_or(false, |furthest| my_distance <= furthest)
}
_ => {
// Lookup failed — can't determine, so do NOT confirm responsibility.
false

Copilot uses AI. Check for mistakes.
Comment on lines +196 to +200
_ => {
// Lookup failed — can't determine, assume we're in to avoid
// rejecting valid payments
true
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On lookup failure/timeout this function returns true, which means close-group validation in PUT becomes a no-op whenever the DHT is unhealthy (exactly the scenario this is meant to guard). Consider returning false (or a Result so callers can return a retryable error) and logging at warn level to aid diagnosis.

Suggested change
_ => {
// Lookup failed — can't determine, assume we're in to avoid
// rejecting valid payments
true
}
Ok(Err(err)) => {
// Lookup failed — treat as not in close group and log for diagnosis.
warn!(
?err,
"DHT close group lookup failed while checking node responsibility; \
treating node as not in close group"
);
false
}
Err(elapsed) => {
// Lookup timed out — treat as not in close group and log for diagnosis.
warn!(
?elapsed,
"DHT close group lookup timed out after {:?} while checking node responsibility; \
treating node as not in close group",
CONFIRMATION_LOOKUP_TIMEOUT
);
false
}

Copilot uses AI. Check for mistakes.
Comment on lines +101 to +107
for &obs_idx in observers {
if let Some(p2p) = harness.node(obs_idx) {
total_queried += 1;
let timeout_dur = Duration::from_secs(30);
let query = p2p.dht().find_closest_nodes(target, K);
match tokio::time::timeout(timeout_dur, query).await {
Ok(Ok(peers)) if !peers.is_empty() => {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These per-observer DHT lookups are executed sequentially, each with a 30s timeout. If the DHT is unhealthy, the test can spend a long time waiting on timeouts back-to-back and stall CI. Consider running observer lookups in parallel (with a concurrency cap) and/or adding an overall test timeout to bound failure-mode runtime.

Copilot uses AI. Check for mistakes.
Comment on lines +149 to +154
if let Some(p2p) = harness.node(observer_idx) {
match tokio::time::timeout(DHT_LOOKUP_TIMEOUT, p2p.dht().find_closest_nodes(target, k))
.await
{
Ok(Ok(peers)) => peers.iter().map(|p| p.peer_id.to_hex()).collect(),
_ => vec![],
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dht_lookup uses a 30s timeout and is called in large nested loops (including 100-node tests). In failure modes this can make the suite take an extremely long time due to sequential timeouts. Consider reducing the timeout for these diagnostics, parallelizing lookups with a concurrency cap, and/or adding an overall per-test timeout.

Copilot uses AI. Check for mistakes.
Comment on lines +210 to +214
if !is_node_in_close_group(p2p, &address).await {
debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
return ChunkPutResponse::Error(ProtocolError::Internal(
"This node is not in the close group for this address".to_string(),
));
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close-group rejection is returned as ProtocolError::Internal, but this is an expected routing/ownership condition rather than an internal failure. Consider adding a dedicated protocol error/response for “not responsible / not in close group” so clients can distinguish it from server errors and retry against other close-group members.

Copilot uses AI. Check for mistakes.
Comment on lines +96 to +99
pub fn with_p2p_node(self, p2p_node: Arc<P2PNode>) -> Self {
// OnceLock is empty after construction, so set() always succeeds here.
let _ = self.p2p_node.set(p2p_node);
self
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OnceLock::set returns Err if the value was already set, but the result is ignored here. If with_p2p_node is ever called on an AntProtocol that already had p2p_node configured, this would silently do nothing and make debugging wiring issues hard. Consider handling the Err (e.g., debug_assert!, returning self unchanged with a warn!, or changing the API to make double-setting impossible).

Copilot uses AI. Check for mistakes.
Comment on lines +97 to +101
let step = if nodes.len() > actual_lookups {
nodes.len() / actual_lookups
} else {
1
};
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step computation can panic when actual_lookups == 0 (e.g., num_lookups == 0 or nodes is empty) due to division by zero. Add an early return when actual_lookups == 0 (and/or nodes.is_empty()) that returns an empty ConfirmedCloseGroup.

Copilot uses AI. Check for mistakes.
Comment on lines +151 to +152
// Trim to close group size
confirmed.truncate(CLOSE_GROUP_SIZE);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirm_close_group takes a k parameter, but the returned members is always truncated to CLOSE_GROUP_SIZE, so callers can’t get more than 5 confirmed peers even if they request a larger set. Either truncate to k (or k.min(CLOSE_GROUP_SIZE) if that’s the intent), or remove k from the public API and document the fixed-size behavior.

Suggested change
// Trim to close group size
confirmed.truncate(CLOSE_GROUP_SIZE);
// Trim to the requested size, but never exceed the close group size
confirmed.truncate(k.min(CLOSE_GROUP_SIZE));

Copilot uses AI. Check for mistakes.
- Add NotInCloseGroup protocol error variant instead of using Internal
- Add early return for empty nodes/lookups in confirm_close_group
- Fix truncation to use k parameter instead of hardcoded CLOSE_GROUP_SIZE
- Make is_node_in_close_group return false on undersized/failed lookups
- Log warning on double-set of P2P node via OnceLock
- Use is_some_and instead of map_or(false, ...) per clippy
Copilot AI review requested due to automatic review settings March 26, 2026 08:25
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +305 to +312
min_discovered >= total_nodes / 4,
"Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \
routing tables are severely incomplete"
);

// Every node should have at least 3 direct connections
assert!(
min_peers >= 2,
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion threshold doesn’t match the preceding comment: the comment says each node should discover at least 50% of peers via DHT, but the code asserts only total_nodes / 4 (25%). Please align the comment and the threshold so the test’s intent is unambiguous.

Suggested change
min_discovered >= total_nodes / 4,
"Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \
routing tables are severely incomplete"
);
// Every node should have at least 3 direct connections
assert!(
min_peers >= 2,
min_discovered >= total_nodes / 2,
"Node with fewest DHT discoveries found only {min_discovered}/{total_nodes} — \
routing tables are severely incomplete"
);
// Every node should have at least 3 direct connections
assert!(
min_peers >= 3,

Copilot uses AI. Check for mistakes.
routing tables are severely incomplete"
);

// Every node should have at least 3 direct connections
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "Every node should have at least 3 direct connections", but the assertion allows min_peers >= 2. Please make the comment and the assertion consistent (either tighten the assert to 3 or update the comment).

Suggested change
// Every node should have at least 3 direct connections
// Every node should have at least 2 direct connections

Copilot uses AI. Check for mistakes.
Comment on lines +4 to +5
//! Multiple nodes independently look up the same address and the intersection
//! of their results forms the "confirmed" close group.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module docs say the confirmed close group is formed by the intersection of lookup results, but confirm_close_group() actually uses an appearance-count threshold (quorum/majority), not strict intersection. This is misleading for callers and future maintainers—please update the docs to describe the threshold-based confirmation correctly.

Suggested change
//! Multiple nodes independently look up the same address and the intersection
//! of their results forms the "confirmed" close group.
//! Multiple nodes independently look up the same address and aggregate the
//! returned close groups by counting how often each peer appears. Peers that
//! meet or exceed a configured appearance threshold (e.g. quorum/majority)
//! form the "confirmed" close group.

Copilot uses AI. Check for mistakes.
Comment on lines +209 to +217
// 4. Close group verification — check if this node is responsible
// for the address. This prevents storing data we're not close to,
// which is critical for merkle payment verification consistency.
if let Some(p2p) = self.p2p_node.get() {
if !is_node_in_close_group(p2p, &address).await {
debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);
}
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close-group verification runs before payment verification. This makes unauthenticated/unpaid PUT attempts trigger an extra DHT lookup (amplifying DoS potential) and also changes the error precedence clients see (e.g., NotInCloseGroup instead of PaymentRequired/PaymentFailed). Consider moving the close-group check after you’ve determined the request is payable/paid, and ideally only enforcing it for Merkle proofs (SingleNode proofs likely shouldn’t require close-group membership).

Copilot uses AI. Check for mistakes.
- Replace direct indexing nodes[node_idx] with nodes.get() in
  confirm_close_group (project rule: no direct indexing)
- Reorder PUT handler: payment verification (O(1) cache check) now runs
  before close group check (DHT lookup with 15s timeout) to prevent
  unpaid requests from triggering expensive DHT lookups (DoS vector)
- Comment said 50% but assertion was 25% (total_nodes / 4) -- fixed comment
- Comment said 3 connections but assertion was >= 2 -- fixed comment
- Module doc said "intersection" but code uses quorum/threshold -- fixed doc
Copilot AI review requested due to automatic review settings March 26, 2026 08:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


### Fix 3: Close group validation in PUT handler (`src/storage/handler.rs`)

`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The findings doc says the PUT handler performs the close-group check “before payment verification”, but src/storage/handler.rs now does the close-group lookup after verifying payment (to avoid unpaid DHT-lookup DoS). Please update this section to match the implemented order so the document doesn’t mislead future readers.

Suggested change
`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification.
`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, after successfully verifying payment, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification while avoiding unpaid DHT-lookup DoS vectors.

Copilot uses AI. Check for mistakes.
Comment on lines 322 to 329
PaymentFailed(String),
/// Quote generation failed.
QuoteFailed(String),
/// This node is not in the close group for the requested address.
/// The client should retry with a different close-group member.
NotInCloseGroup,
/// Internal error.
Internal(String),
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding ProtocolError::NotInCloseGroup changes the on-wire postcard/serde enum layout for ProtocolError. Older nodes/clients using the existing autonomi/ant/chunk/v1 protocol will fail to deserialize (or misinterpret) responses containing this new variant, and even the Internal variant’s index shifts. Please introduce explicit protocol versioning for this wire type (e.g., bump CHUNK_PROTOCOL_ID/PROTOCOL_VERSION and route accordingly, or redesign error encoding to be forward/backward compatible) before shipping this new error across the network.

Copilot uses AI. Check for mistakes.
Comment on lines +234 to +237
if let Some(p2p) = self.p2p_node.get() {
if !is_node_in_close_group(p2p, &address).await {
debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new close-group enforcement will reject PUTs on nodes that aren’t in the close group for address. This is a behavioral breaking change for any caller that PUTs to an arbitrary node (including existing E2E helpers like TestNode::store_chunk/remote store tests that don’t first choose a close-group member). Either update the call paths/tests to first select the close group via DHT and target those nodes, or gate this check behind an explicit mode/config/payment-proof type so legacy PUT flows keep working.

Suggested change
if let Some(p2p) = self.p2p_node.get() {
if !is_node_in_close_group(p2p, &address).await {
debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);
//
// To avoid breaking legacy PUT flows that send to arbitrary nodes,
// we only enforce close-group checks when a payment proof is supplied.
if request.payment_proof.is_some() {
if let Some(p2p) = self.p2p_node.get() {
if !is_node_in_close_group(p2p, &address).await {
debug!("Rejecting PUT for {addr_hex}: this node is not in the close group");
return ChunkPutResponse::Error(ProtocolError::NotInCloseGroup);
}

Copilot uses AI. Check for mistakes.
Comment on lines +90 to +101
num_lookups: usize,
threshold: usize,
) -> ConfirmedCloseGroup {
let actual_lookups = num_lookups.min(nodes.len());

if actual_lookups == 0 || nodes.is_empty() {
return ConfirmedCloseGroup {
members: Vec::new(),
num_lookups: 0,
num_responses: 0,
threshold,
};
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirm_close_group() accepts threshold without validation. If a caller passes threshold > actual_lookups, confirmation becomes impossible and the function will always return an empty members list with no clear signal that the parameters are invalid. Consider clamping threshold to actual_lookups (or returning an explicit empty result with a warning/error) so misconfiguration doesn’t silently disable confirmation.

Copilot uses AI. Check for mistakes.
During merkle payment verification, nodes now check that the candidate
nodes in the winner pool are actually the closest nodes to the data
address. This prevents malicious clients from building winner pools
with arbitrary (non-closest) nodes.

The verification performs a DHT lookup for the data address, derives
peer IDs from each candidate ML-DSA public key, and checks that a
majority appear in the close group. On DHT failure, the check is
skipped with a warning for availability during network instability.
Addresses critical review findings:

1. Quote handlers (handle_quote, handle_merkle_candidate_quote) now
   verify the node is in the close group before issuing quotes. This
   prevents the quote-pay-refuse failure path where a client pays
   a quoted node that later rejects the PUT with NotInCloseGroup.

2. is_node_in_close_group now requests K+1 peers and filters out
   self before comparing distances. This fixes the off-by-one where
   the DHT may include the querying node in results, making the
   comparison against only K-1 external peers.

3. Test assertion comments aligned with actual thresholds (comments
   now describe the threshold as a catastrophic-failure detector,
   with the VERDICT log showing ideal-bar status).
Copilot AI review requested due to automatic review settings March 26, 2026 08:48
Replace the single-lookup is_node_in_close_group with a multi-lookup
threshold-based approach. The function now performs 3 independent DHT
lookups (each following different iterative paths through the network)
and requires 2/3 to agree that this node is in the close group.

This directly addresses the core problem: with routing tables only
covering 13-17% of the network, a single lookup can follow a path
that misses closer nodes. Multiple lookups traverse different parts
of the routing graph and are far more likely to discover the true
closest peers.

The same consensus mechanism is now used in the live PUT path, quote
handlers, and merkle candidate verification — not just in tests.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +204 to +213
// If we couldn't retrieve enough external peers, we can't confirm
// responsibility — treat as "not in close group" so the PUT is
// rejected or retried rather than silently accepted.
if external_peers.len() < CLOSE_GROUP_SIZE {
warn!(
"is_node_in_close_group: only found {} external peers (need {CLOSE_GROUP_SIZE})",
external_peers.len()
);
return false;
}
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_node_in_close_group returns false whenever fewer than CLOSE_GROUP_SIZE external peers are returned. In small networks (e.g. 5 nodes) find_closest_nodes can only return 4 external peers, so this will always mark the node as not-in-close-group and cause all Quote/PUT requests to be rejected with NotInCloseGroup once the P2P node is wired in. Consider handling small-network cases (e.g. if node.peer_count().await + 1 <= CLOSE_GROUP_SIZE, treat as in close group), or relaxing the minimum-external-peers requirement so close-group checks don’t brick early deployments.

Copilot uses AI. Check for mistakes.
Comment on lines +736 to +755
// Empty results — can't verify, log and allow
warn!(
"Cannot verify merkle candidates for {}: DHT returned empty results",
hex::encode(xorname)
);
return Ok(());
}
Ok(Err(e)) => {
warn!(
"Cannot verify merkle candidates for {}: DHT error: {e}",
hex::encode(xorname)
);
return Ok(());
}
Err(_) => {
warn!(
"Cannot verify merkle candidates for {}: DHT lookup timed out",
hex::encode(xorname)
);
return Ok(());
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_candidates_are_closest currently returns Ok(()) (allowing verification to pass) when the DHT lookup is empty/errors/times out. Since this check is explicitly intended to prevent malicious clients from submitting non-close-group candidate pools, failing open here allows an attacker to bypass the protection by inducing DHT failures/timeouts (e.g. via load) at the verifier. Consider failing closed (returning an error so the client retries), or gating this behavior behind a config flag with a clear default for security vs availability.

Suggested change
// Empty results — can't verify, log and allow
warn!(
"Cannot verify merkle candidates for {}: DHT returned empty results",
hex::encode(xorname)
);
return Ok(());
}
Ok(Err(e)) => {
warn!(
"Cannot verify merkle candidates for {}: DHT error: {e}",
hex::encode(xorname)
);
return Ok(());
}
Err(_) => {
warn!(
"Cannot verify merkle candidates for {}: DHT lookup timed out",
hex::encode(xorname)
);
return Ok(());
// Empty results — can't verify, fail closed so caller can retry
warn!(
"Cannot verify merkle candidates for {}: DHT returned empty results",
hex::encode(xorname)
);
return Err(Error::Payment(format!(
"Unable to verify merkle candidates for {}: DHT returned empty results",
hex::encode(xorname)
)));
}
Ok(Err(e)) => {
warn!(
"Cannot verify merkle candidates for {}: DHT error: {e}",
hex::encode(xorname)
);
return Err(Error::Payment(format!(
"Unable to verify merkle candidates for {}: DHT error: {e}",
hex::encode(xorname)
)));
}
Err(_) => {
warn!(
"Cannot verify merkle candidates for {}: DHT lookup timed out",
hex::encode(xorname)
);
return Err(Error::Payment(format!(
"Unable to verify merkle candidates for {}: DHT lookup timed out",
hex::encode(xorname)
)));

Copilot uses AI. Check for mistakes.
Comment on lines +76 to +80
### Fix 3: Close group validation in PUT handler (`src/storage/handler.rs`)

`AntProtocol` now optionally holds a reference to the `P2PNode` (via `OnceLock` for deferred initialization). During PUT requests, before payment verification, the handler calls `is_node_in_close_group()` to verify the node is actually responsible for the address. This prevents nodes from storing data they're not close to, which is critical for merkle payment verification.

Wired in production (`src/node.rs`) via `with_p2p_node()` and in tests (`tests/e2e/testnet.rs`) via `set_p2p_node()`.
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The findings doc says the PUT handler calls is_node_in_close_group() before payment verification, but the implementation in src/storage/handler.rs performs the close-group check after payment verification (to avoid DHT-lookup DoS). Please update this text to match the actual behavior so the documentation doesn’t mislead readers.

Copilot uses AI. Check for mistakes.
Documents the root cause (routing table not seeded from transport
connections) and proposes 4 fixes in saorsa-core:

1. Seed DHT routing table from transport connections (critical)
2. Periodic k-bucket refresh (important)
3. Expose routing table contents via API (nice-to-have)
4. Local-only find_closest_nodes (important for performance)

Includes measured data, expected impact, validation steps, and
references to the original Kademlia paper.
Copilot AI review requested due to automatic review settings March 26, 2026 08:55
The devnet path (src/devnet.rs) was missing the set_p2p_node call,
meaning close group verification was silently skipped for all devnet
nodes. Now all three node creation paths (production, devnet, test)
wire the P2P node into the protocol handler and payment verifier.
Local-only find_closest_nodes would return inaccurate results with
routing tables at 13-17% discovery -- the exact problem we are
trying to fix. Only viable after routing table seeding is implemented.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +203 to +205
// Request K+1 peers because the DHT may include this node in results.
let lookup_k = CLOSE_GROUP_SIZE + 1;

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_node_in_close_group requests CLOSE_GROUP_SIZE + 1 peers and then computes the max distance across all external_peers. If the DHT result does not include this node, external_peers can contain K+1 peers, making the “furthest” distance correspond to the (K+1)th closest and causing false positives (a node just outside the close group may be accepted). Consider trimming to exactly CLOSE_GROUP_SIZE external peers (by XOR distance) before comparing, or otherwise ensure you’re comparing against the Kth-closest boundary excluding self.

Copilot uses AI. Check for mistakes.
Comment on lines +776 to +793
// Require at least a majority of candidates to be in the close group.
// We use CLOSE_GROUP_SIZE as the bar since the winner pool may have
// more candidates than the close group (e.g., 16 candidates for K=5).
let min_required = if candidates.len() >= CLOSE_GROUP_SIZE {
// At least CLOSE_GROUP_MAJORITY of the close group nodes
// should appear among the candidates
(CLOSE_GROUP_SIZE / 2) + 1
} else {
// Small pool — require majority of what's there
(candidates.len() / 2) + 1
};

if candidates_in_close_group < min_required {
return Err(Error::Payment(format!(
"Merkle candidate pool for {} has only {candidates_in_close_group} nodes \
in the close group (need {min_required}). Candidates may not be the \
actual closest nodes to this address.",
hex::encode(xorname)
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc/comment says “majority of the candidates” must be in the close group, but the computed min_required is based on CLOSE_GROUP_SIZE when candidates.len() >= CLOSE_GROUP_SIZE (e.g., only 3 required for a 16-candidate pool). Either tighten the threshold to match the stated intent (majority of candidates / all paid recipients), or update the docs/comments/error message so the security model is unambiguous.

Suggested change
// Require at least a majority of candidates to be in the close group.
// We use CLOSE_GROUP_SIZE as the bar since the winner pool may have
// more candidates than the close group (e.g., 16 candidates for K=5).
let min_required = if candidates.len() >= CLOSE_GROUP_SIZE {
// At least CLOSE_GROUP_MAJORITY of the close group nodes
// should appear among the candidates
(CLOSE_GROUP_SIZE / 2) + 1
} else {
// Small pool — require majority of what's there
(candidates.len() / 2) + 1
};
if candidates_in_close_group < min_required {
return Err(Error::Payment(format!(
"Merkle candidate pool for {} has only {candidates_in_close_group} nodes \
in the close group (need {min_required}). Candidates may not be the \
actual closest nodes to this address.",
hex::encode(xorname)
// Require that a majority of the relevant nodes are in the close group.
// When there are at least CLOSE_GROUP_SIZE candidates, we require a
// majority of the close group itself to appear among the candidates.
// When there are fewer candidates than CLOSE_GROUP_SIZE, we require a
// majority of the available candidates.
let min_required = if candidates.len() >= CLOSE_GROUP_SIZE {
// At least a simple majority of the close group nodes should appear
// among the candidates.
(CLOSE_GROUP_SIZE / 2) + 1
} else {
// Small pool — require majority of what's there.
(candidates.len() / 2) + 1
};
if candidates_in_close_group < min_required {
return Err(Error::Payment(format!(
"Merkle candidate pool for {} has only {candidates_in_close_group} nodes \
in the close group (need at least {min_required} based on close group \
size {close_group_size} and {candidates_len} candidates). Candidates \
may not be the actual closest nodes to this address.",
hex::encode(xorname),
close_group_size = CLOSE_GROUP_SIZE,
candidates_len = candidates.len(),

Copilot uses AI. Check for mistakes.
Comment on lines +293 to +297
// Verify this node is in the close group for the address before
// issuing a quote. This prevents clients from paying nodes that
// will later refuse the PUT with NotInCloseGroup.
if let Some(p2p) = self.p2p_node.get() {
if !is_node_in_close_group(p2p, &request.address).await {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_quote now performs close-group verification via is_node_in_close_group, which triggers multiple DHT lookups per unauthenticated/unpaid quote request (and can block up to the lookup timeout). This increases quote latency and creates an easy DHT-amplification/DoS vector. Consider adding caching (per address and/or per peer), rate limiting, a cheaper/local check for the quote path, and/or shorter/parallelized lookups specifically for quote handling.

Copilot uses AI. Check for mistakes.
Comment on lines +363 to +367
// Verify this node is in the close group for the address before
// issuing a merkle candidate quote.
if let Some(p2p) = self.p2p_node.get() {
if !is_node_in_close_group(p2p, &request.address).await {
debug!("Refusing merkle candidate quote for {addr_hex}: not in close group");
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle_merkle_candidate_quote also runs is_node_in_close_group (multiple DHT lookups) before issuing a candidate quote. Since these requests are likewise unauthenticated/unpaid, this can be abused to generate sustained DHT traffic and add significant latency to the quoting flow. Consider applying the same mitigations as for handle_quote (caching/rate limiting/cheaper check/shorter timeouts).

Copilot uses AI. Check for mistakes.
@grumbach
Copy link
Copy Markdown
Collaborator Author

Closing for now, @mickvandijke feel free to re-open

@grumbach grumbach closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants