Add passive neighbor-liveness inference with probe confirmation by attermann · Pull Request #64 · attermann/microReticulum

attermann · 2026-06-24T21:02:59Z

Problem

A transport node currently has no way to detect when a direct (hops==0)
neighbor is suffering asymmetric RF connectivity. The neighbor can still
transmit, so its announces continue to flow, path-table entries refresh
in the inbound handler, and age-based expiry never fires. But its
receive path silently drops the data we forward. Packets routed through
it disappear without trace.

The failure mode is worst for fire-and-forget PROVE_NONE traffic and
for paths used by quiet destinations that do not emit return traffic,
because nothing in the existing stack notices that forwarded packets
are never being acknowledged. Sustained blackhole forwarding continues
until the announce horizon eventually runs out, which in some modes
takes hours.

Solution

A local-only, passive inference layer on the transport side:

Track per-direct-neighbor counters: packets forwarded through this
next-hop, and proofs that subsequently returned along the matching
reverse-table route.
On each maintenance tick, evaluate suspicion: if a neighbor has had
meaningful forwarding within the window but no proof returned, it is
a candidate. Per-neighbor rate limits keep probe traffic bounded.
For each candidate, send one packet to the neighbor's existing
probe_destination (already PROVE_ALL and advertised via mgmt
destinations when the peer runs with probe_destination_enabled). The
PacketReceipt's delivery callback confirms reciprocal reachability;
the timeout callback marks every path going through this neighbor as
unresponsive, letting the existing announce-replacement logic swap in
a working route on the next fresh announce.
No wire-format changes. No new packet types. No HELLO protocol. No
changes to receiver-side proof_strategy or PROVE_* semantics. The
whole mechanism rides on existing primitives: probe_destination,
reverse_table, mark_path_responsive, mark_path_unresponsive, and
PacketReceipt callbacks.

Changes

PacketReceipt acquires a std::function-based callback variant so the
neighbor-probe outcome handlers can capture neighbor_hash by value. The
plain function-pointer setters remain for source compatibility with
existing firmware and are marked deprecated. As an incidental fix,
PacketReceipt::check_timeout previously set status to FAILED but left
the dispatch site as a commented-out thread stub from the Python
reference; the timeout callback never actually fired. It now invokes
synchronously, which the new feature depends on and which any prior
caller relying on timeout callbacks also benefits from.

ReverseEntry gains a _next_hop field so a returning proof can be
attributed to the neighbor that forwarded the original packet. The
construction site in Transport::inbound populates it from the next-hop
already in scope.

Transport gains a NeighborStat struct (packets forwarded, proofs
received, timestamps, probe-pending state, pending probe hash) and an
in-memory NeighborStatsTable keyed by neighbor hash. Counters are
window-relative and reset on successful probe completion or after
extended idle. Five tunables (suspicion window, min-packet threshold,
probe rate-limit, probe timeout, probe payload size) sit alongside the
existing Transport timing constants.

Hooks: outbound() increments packets_forwarded after a successful
transmit in all three forwarding branches; the proof-consumption block
increments proofs_received when a returning proof is transported back
through the reverse_table. jobs() runs a scan each tick that walks
neighbor_stats through five gates (sufficient activity, recent
forwarding, no recent proof, no probe in flight, rate limit ok),
snapshots candidates, then dispatches probes. Outcome handlers reset
counters on delivery and demote paths on timeout.

Reticulum gains two programmatic toggles with accessor pairs:
neighbor_probing_enabled (default on) gates the whole feature, and
neighbor_probing_path_request_fallback_enabled (default off) optionally
issues a path request when a suspect neighbor's probe destination is
not yet in the path table.

Lifecycle: remove_path and the legacy _path_table cull both erase the
matching neighbor_stats entry when a path is dropped, and the scan
itself resets accumulated counters for neighbors idle past twice the
suspicion window.

All new code is guarded by RNS_NEIGHBOR_PROBING (default on; set
-DRNS_NEIGHBOR_PROBING=0 in build_flags to compile out) and prefixed
with DIVERGENCE comments noting how each piece relates to the Python
reference. The check_timeout dispatch fix is unguarded because it is a
latent-bug fix, not a divergence.

…eout dispatch Introduce RNS_NEIGHBOR_PROBING feature flag (default on) for upcoming passive neighbor-liveness inference work. Under this flag, PacketReceipt gains std::function-based delivery/timeout handler setters that accept capture-bearing callables (e.g. lambdas closing over local state); the existing plain function-pointer setters remain for source compatibility with out-of-tree firmware and are marked deprecated. PacketReceipt::check_timeout() previously set status to FAILED but left callback dispatch as a //z thread stub from the Python reference, so the registered timeout callback was never invoked. Wire up synchronous dispatch so the timeout-callback contract actually fires. The dispatcher prefers the std::function handler when set, otherwise falls back to the legacy function-pointer callback. Add test_receipt_timeout_handler_capture covering both the capture-bearing handler path and the timeout dispatch fix. All 171 existing tests continue to pass on native17.

Add an _next_hop field to ReverseEntry so that when a proof comes back along a forwarded route, the inbound proof-consumption code can identify which direct neighbor forwarded the original packet. This is a building block for passive neighbor-liveness inference: counting proofs returned per neighbor lets us spot asymmetric RF connectivity where a neighbor keeps transmitting but silently drops everything inbound. The Python reference plan extends its reverse_table list with an IDX_RT_NEXT_HOP slot; the C++ port uses named members on the existing ReverseEntry class instead. Construction site in Transport::inbound is updated to pass next_hop, which is already in scope from the forwarding calculation a few lines above. Gated on RNS_NEIGHBOR_PROBING; the no-feature build keeps the original three-argument ReverseEntry constructor. All 171 native17 tests pass.

Wire up the data structures and configuration surface for passive neighbor-liveness inference: - Five NEIGHBOR_* tunables alongside the existing Transport timing constants in Type.h (suspicion window, min-packet threshold, probe rate-limit, probe timeout, probe payload size). - NeighborStat struct + NeighborStatsTable using-alias in Transport.h, with a static _neighbor_stats member for the per-neighbor counters. In-memory only (ephemeral state; not microStore-backed). Uses ContainerAllocator so long-lived entries can live in the configured container memory pool. - Two static-bool Reticulum toggles with accessor pairs: neighbor_probing_enabled (default true) and neighbor_probing_path_request_fallback_enabled (default false). The Python reference plan parses these from the reticulum INI block; microReticulum has no INI parser, so they are exposed as static accessors only. All additions guarded by RNS_NEIGHBOR_PROBING with DIVERGENCE comments. No new behavior yet; hooks land in subsequent commits. All 171 native17 tests pass.

Add _record_neighbor_packet helper and call it from the three transmit sites in Transport::outbound: the multi-hop forwarding branch, the hops==1 shared-instance forwarding branch, and the direct-delivery branch. The helper increments packets_forwarded and stamps last_packet_at on the neighbor's NeighborStat entry, creating it on demand. The next-hop attribution uses destination_entry._received_from in all three branches. For multi-hop paths this is the transport node we are handing the packet off to; for hops==0 (directly reachable) it is the destination itself, which by definition is the immediate neighbor. Only counts when transmit returned true so failed sends do not bias the stats. Empty next-hop (broadcast, unknown reverse entry) is silently ignored. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.

Add _record_neighbor_proof helper and call it from the proof-forwarding block in inbound packet handling. When a returning proof matches a reverse_table entry and gets transported back along the original route, attribute the proof to the neighbor recorded in reverse_entry._next_hop. Together with the outbound packets_forwarded counter, this gives each direct neighbor a ratio of proofs returned vs packets forwarded over the current window. A sustained low ratio is the signal that drives later suspicion + targeted probe dispatch. Only increments the counter when an entry already exists in _neighbor_stats; proofs for neighbors we never counted outbound through are ignored. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.

Wire the passive neighbor-liveness logic into jobs(). Each tick we walk _neighbor_stats and gate each entry through five filters: sufficient activity, recent forwarding within the suspicion window, no recent proof return, no probe already in flight, and per-neighbor probe rate limit. Surviving entries get a single probe dispatched to the neighbor's built-in probe destination. The probe path: - Identity::recall lifts the neighbor's identity out of the announce store. If the identity is not yet known, log and skip. - Destination::hash derives the probe-destination hash. If we have no path to it, log and skip; if the path-request fallback is enabled, optionally issue a path request and let a later tick retry. - Otherwise construct an OUT/SINGLE destination, send a 16-byte random payload as a Packet, attach delivery and timeout handlers that capture the neighbor hash by value, and stamp probe_pending / last_probe_at on the stats entry. Delivery handler clears probe_pending, resets the window counters, and walks the path table calling mark_path_responsive on every entry whose next-hop is this neighbor. Timeout handler clears probe_pending and calls mark_path_unresponsive on every such entry, letting existing announce-replacement logic swap in a working route on the next fresh announce. Counter reset on timeout is deliberately not done so the next cycle starts from the suspect state. Scan builds a snapshot list before dispatching to avoid iterator invalidation if a dispatch triggers any synchronous transport activity that touches the stats map. Gated on RNS_NEIGHBOR_PROBING and on the runtime triple (transport_enabled, neighbor_probing_enabled, probe_destination_enabled). All 171 native17 tests pass.

Round out the neighbor-stats lifecycle: - _scan_neighbor_stats now resets packets_forwarded and proofs_received to zero when a neighbor has been idle past twice the suspicion window. Without this, a brief burst of forwarding followed by long quiet would leave stale counters that could spuriously trigger suspicion the next time traffic resumed. - remove_path and the legacy _path_table cull both erase the matching _neighbor_stats entry when a destination's path record is dropped. For hops==0 paths the destination hash equals the neighbor hash so the stats entry is the one keyed by the same value; for hops greater than zero the erase is a no-op. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.

Surface every substantive neighbor-probing event in the log stream so operators can follow what the feature is doing without resorting to a debugger. Level discipline: - NOTICE — actionable failure: probe timed out, paths newly demoted to UNRESPONSIVE. This is what shows up by default and signals trouble. - INFO — substantive lifecycle: neighbor classified as suspicious (with current counters and idle age), probe being sent (with payload size and timeout), probe-delivery success summary, path-request fallback firing. - VERBOSE — per-path state transitions: paths individually promoted UNRESPONSIVE -> RESPONSIVE on probe success, or demoted * -> UNRESPONSIVE on probe failure. Stale-counter resets on long-idle neighbors. - DEBUG — gate-skip diagnostics where the cause is interesting (identity not yet known, no path to peer's probe destination, first-time tracking of a new neighbor, stats erased due to path removal). - TRACE — every counter increment and every skip reason during the per-tick scan (idle / insufficient activity / recent proof / probe pending / rate-limited). The probe-delivered and probe-timed-out summaries now distinguish "actual state transition" from "already-in-that-state no-op", so the count in the summary reflects what really changed rather than how many path-table entries the scan touched. Path-state checks read _path_states before calling mark_path_*; only a true UNRESPONSIVE->RESPONSIVE or non-UNRESPONSIVE->UNRESPONSIVE flip emits a per-path VERBOSE line. The aggregate NOTICE/INFO at the end reports both the transition count and the total matched. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.

- Removed new packet std::function callbacks from RNS_NEIGHBOR_PROBING gating - Added RNS_NEIGHBOR_PATH_REQUEST gating to replace runtime gating

nilu96 · 2026-06-25T12:35:49Z

Hi attermann,

this is great! Passive direct neighbor tracking is exactly what I’ve been thinking about over the past few weeks. Here are some of my notes and ideas—hopefully, they help with further development.

The Goal

Identify stable, bidirectional connections so we can prioritize routes where we are confident the next hop will receive the packet.

Discarded Idea: Tracking Repeated Announces

Initially, I considered tracking repeated announces that a node recently broadcasted. However, announce behavior in a complex mesh is too chaotic for this to be reliable. Announces can be received from other nodes, and the rules around retry counters (decreasing on same hop count, dropping on hops+1) make it too unpredictable.

Proposed Tracking Mechanisms

Instead, here are three passive methods that might yield better results:

1-Hop Announces: Listen only to announces for our own destinations where hops == 1. Since there are no alternative paths for a 1-hop announce, this should be a reliable way to track responsiveness of direct neighbors. BUT This only works to set neighbors to a RESPONSIVE state. If a nearby node heard an announce it might not repeat it if another node already repeated it before.
Passive Packet Monitoring: Listen for transport nodes to repeat regular packets. If a packet is sent to a next hop that is a transport node (not the final destination) and that node does not repeat it, mark the neighbor as UNRESPONSIVE.
Link Establishment Proofs (LRPROOF / PROOF):
- If received: Mark neighbor as RESPONSIVE.
- If missing: Mark as UNRESPONSIVE. (Note: This missing check only reliably works for LRPROOF when we are the last node before the final destination. Regular PROOF isn't mandatory, and LRPROOF could just be lost on earlier hops, if we are multiple hops away from final destination).

State Management

A strict binary state (RESPONSIVE / UNRESPONSIVE) might be too brittle. It could be better to:

Allow thresholds: Tolerate a certain amount of packet loss before flipping a neighbor to unresponsive.
Allow healing: Add a timeout mechanism so the UNRESPONSIVE flag can "heal" and revert after a certain period of time.

Routing Impacts

If a neighbor is flagged as UNRESPONSIVE, here is how I imagine it should impact routing rules:

Handling Announces from `UNRESPONSIVE` Neighbors:

Do not repeat announces from this node (since we know we can't reliably reach them).
Path Creation: If the destination is currently unknown, create a new path entry.
Path Updates:
- DON'T overwrite an existing path that goes through a RESPONSIVE (or yet unclassified) neighbor, even if the UNRESPONSIVE route has fewer hops. (maybe use an internal hop penalty?)
- DO update the path regularly (depending on hop count) if the existing route also relies on an UNRESPONSIVE neighbor.

These rules would allow that an announce that took a slightly longer but stable path to be preferred over a path via an unresponsive neighbor.

Handling Path Requests for `UNRESPONSIVE` Neighbors:

Repeat the path request, but do not answer it directly.

attermann added 10 commits June 24, 2026 12:17

Corrections to some gating decisions

30fb787

- Removed new packet std::function callbacks from RNS_NEIGHBOR_PROBING gating - Added RNS_NEIGHBOR_PATH_REQUEST gating to replace runtime gating

Added metrics to track path state changes and neighbor probes

fb1bf18

attermann merged commit b8e529f into master Jun 24, 2026
8 checks passed

attermann deleted the neighbor_probe branch June 24, 2026 21:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add passive neighbor-liveness inference with probe confirmation#64

Add passive neighbor-liveness inference with probe confirmation#64
attermann merged 10 commits into
masterfrom
neighbor_probe

attermann commented Jun 24, 2026

Uh oh!

Uh oh!

nilu96 commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

attermann commented Jun 24, 2026

Problem

Solution

Changes

Uh oh!

Uh oh!

nilu96 commented Jun 25, 2026

The Goal

Discarded Idea: Tracking Repeated Announces

Proposed Tracking Mechanisms

State Management

Routing Impacts

Handling Announces from UNRESPONSIVE Neighbors:

Handling Path Requests for UNRESPONSIVE Neighbors:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Handling Announces from `UNRESPONSIVE` Neighbors:

Handling Path Requests for `UNRESPONSIVE` Neighbors: