Add passive neighbor-liveness inference with probe confirmation#64
Conversation
…eout dispatch Introduce RNS_NEIGHBOR_PROBING feature flag (default on) for upcoming passive neighbor-liveness inference work. Under this flag, PacketReceipt gains std::function-based delivery/timeout handler setters that accept capture-bearing callables (e.g. lambdas closing over local state); the existing plain function-pointer setters remain for source compatibility with out-of-tree firmware and are marked deprecated. PacketReceipt::check_timeout() previously set status to FAILED but left callback dispatch as a //z thread stub from the Python reference, so the registered timeout callback was never invoked. Wire up synchronous dispatch so the timeout-callback contract actually fires. The dispatcher prefers the std::function handler when set, otherwise falls back to the legacy function-pointer callback. Add test_receipt_timeout_handler_capture covering both the capture-bearing handler path and the timeout dispatch fix. All 171 existing tests continue to pass on native17.
Add an _next_hop field to ReverseEntry so that when a proof comes back along a forwarded route, the inbound proof-consumption code can identify which direct neighbor forwarded the original packet. This is a building block for passive neighbor-liveness inference: counting proofs returned per neighbor lets us spot asymmetric RF connectivity where a neighbor keeps transmitting but silently drops everything inbound. The Python reference plan extends its reverse_table list with an IDX_RT_NEXT_HOP slot; the C++ port uses named members on the existing ReverseEntry class instead. Construction site in Transport::inbound is updated to pass next_hop, which is already in scope from the forwarding calculation a few lines above. Gated on RNS_NEIGHBOR_PROBING; the no-feature build keeps the original three-argument ReverseEntry constructor. All 171 native17 tests pass.
Wire up the data structures and configuration surface for passive neighbor-liveness inference: - Five NEIGHBOR_* tunables alongside the existing Transport timing constants in Type.h (suspicion window, min-packet threshold, probe rate-limit, probe timeout, probe payload size). - NeighborStat struct + NeighborStatsTable using-alias in Transport.h, with a static _neighbor_stats member for the per-neighbor counters. In-memory only (ephemeral state; not microStore-backed). Uses ContainerAllocator so long-lived entries can live in the configured container memory pool. - Two static-bool Reticulum toggles with accessor pairs: neighbor_probing_enabled (default true) and neighbor_probing_path_request_fallback_enabled (default false). The Python reference plan parses these from the reticulum INI block; microReticulum has no INI parser, so they are exposed as static accessors only. All additions guarded by RNS_NEIGHBOR_PROBING with DIVERGENCE comments. No new behavior yet; hooks land in subsequent commits. All 171 native17 tests pass.
Add _record_neighbor_packet helper and call it from the three transmit sites in Transport::outbound: the multi-hop forwarding branch, the hops==1 shared-instance forwarding branch, and the direct-delivery branch. The helper increments packets_forwarded and stamps last_packet_at on the neighbor's NeighborStat entry, creating it on demand. The next-hop attribution uses destination_entry._received_from in all three branches. For multi-hop paths this is the transport node we are handing the packet off to; for hops==0 (directly reachable) it is the destination itself, which by definition is the immediate neighbor. Only counts when transmit returned true so failed sends do not bias the stats. Empty next-hop (broadcast, unknown reverse entry) is silently ignored. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
Add _record_neighbor_proof helper and call it from the proof-forwarding block in inbound packet handling. When a returning proof matches a reverse_table entry and gets transported back along the original route, attribute the proof to the neighbor recorded in reverse_entry._next_hop. Together with the outbound packets_forwarded counter, this gives each direct neighbor a ratio of proofs returned vs packets forwarded over the current window. A sustained low ratio is the signal that drives later suspicion + targeted probe dispatch. Only increments the counter when an entry already exists in _neighbor_stats; proofs for neighbors we never counted outbound through are ignored. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
Wire the passive neighbor-liveness logic into jobs(). Each tick we walk _neighbor_stats and gate each entry through five filters: sufficient activity, recent forwarding within the suspicion window, no recent proof return, no probe already in flight, and per-neighbor probe rate limit. Surviving entries get a single probe dispatched to the neighbor's built-in probe destination. The probe path: - Identity::recall lifts the neighbor's identity out of the announce store. If the identity is not yet known, log and skip. - Destination::hash derives the probe-destination hash. If we have no path to it, log and skip; if the path-request fallback is enabled, optionally issue a path request and let a later tick retry. - Otherwise construct an OUT/SINGLE destination, send a 16-byte random payload as a Packet, attach delivery and timeout handlers that capture the neighbor hash by value, and stamp probe_pending / last_probe_at on the stats entry. Delivery handler clears probe_pending, resets the window counters, and walks the path table calling mark_path_responsive on every entry whose next-hop is this neighbor. Timeout handler clears probe_pending and calls mark_path_unresponsive on every such entry, letting existing announce-replacement logic swap in a working route on the next fresh announce. Counter reset on timeout is deliberately not done so the next cycle starts from the suspect state. Scan builds a snapshot list before dispatching to avoid iterator invalidation if a dispatch triggers any synchronous transport activity that touches the stats map. Gated on RNS_NEIGHBOR_PROBING and on the runtime triple (transport_enabled, neighbor_probing_enabled, probe_destination_enabled). All 171 native17 tests pass.
Round out the neighbor-stats lifecycle: - _scan_neighbor_stats now resets packets_forwarded and proofs_received to zero when a neighbor has been idle past twice the suspicion window. Without this, a brief burst of forwarding followed by long quiet would leave stale counters that could spuriously trigger suspicion the next time traffic resumed. - remove_path and the legacy _path_table cull both erase the matching _neighbor_stats entry when a destination's path record is dropped. For hops==0 paths the destination hash equals the neighbor hash so the stats entry is the one keyed by the same value; for hops greater than zero the erase is a no-op. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
Surface every substantive neighbor-probing event in the log stream so operators can follow what the feature is doing without resorting to a debugger. Level discipline: - NOTICE — actionable failure: probe timed out, paths newly demoted to UNRESPONSIVE. This is what shows up by default and signals trouble. - INFO — substantive lifecycle: neighbor classified as suspicious (with current counters and idle age), probe being sent (with payload size and timeout), probe-delivery success summary, path-request fallback firing. - VERBOSE — per-path state transitions: paths individually promoted UNRESPONSIVE -> RESPONSIVE on probe success, or demoted * -> UNRESPONSIVE on probe failure. Stale-counter resets on long-idle neighbors. - DEBUG — gate-skip diagnostics where the cause is interesting (identity not yet known, no path to peer's probe destination, first-time tracking of a new neighbor, stats erased due to path removal). - TRACE — every counter increment and every skip reason during the per-tick scan (idle / insufficient activity / recent proof / probe pending / rate-limited). The probe-delivered and probe-timed-out summaries now distinguish "actual state transition" from "already-in-that-state no-op", so the count in the summary reflects what really changed rather than how many path-table entries the scan touched. Path-state checks read _path_states before calling mark_path_*; only a true UNRESPONSIVE->RESPONSIVE or non-UNRESPONSIVE->UNRESPONSIVE flip emits a per-path VERBOSE line. The aggregate NOTICE/INFO at the end reports both the transition count and the total matched. Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
- Removed new packet std::function callbacks from RNS_NEIGHBOR_PROBING gating - Added RNS_NEIGHBOR_PATH_REQUEST gating to replace runtime gating
|
Hi attermann, this is great! Passive direct neighbor tracking is exactly what I’ve been thinking about over the past few weeks. Here are some of my notes and ideas—hopefully, they help with further development. The GoalIdentify stable, bidirectional connections so we can prioritize routes where we are confident the next hop will receive the packet. Discarded Idea: Tracking Repeated AnnouncesInitially, I considered tracking repeated announces that a node recently broadcasted. However, announce behavior in a complex mesh is too chaotic for this to be reliable. Announces can be received from other nodes, and the rules around retry counters (decreasing on same hop count, dropping on Proposed Tracking MechanismsInstead, here are three passive methods that might yield better results:
State ManagementA strict binary state (
Routing ImpactsIf a neighbor is flagged as Handling Announces from
|
Problem
A transport node currently has no way to detect when a direct (hops==0)
neighbor is suffering asymmetric RF connectivity. The neighbor can still
transmit, so its announces continue to flow, path-table entries refresh
in the inbound handler, and age-based expiry never fires. But its
receive path silently drops the data we forward. Packets routed through
it disappear without trace.
The failure mode is worst for fire-and-forget PROVE_NONE traffic and
for paths used by quiet destinations that do not emit return traffic,
because nothing in the existing stack notices that forwarded packets
are never being acknowledged. Sustained blackhole forwarding continues
until the announce horizon eventually runs out, which in some modes
takes hours.
Solution
A local-only, passive inference layer on the transport side:
Track per-direct-neighbor counters: packets forwarded through this
next-hop, and proofs that subsequently returned along the matching
reverse-table route.
On each maintenance tick, evaluate suspicion: if a neighbor has had
meaningful forwarding within the window but no proof returned, it is
a candidate. Per-neighbor rate limits keep probe traffic bounded.
For each candidate, send one packet to the neighbor's existing
probe_destination (already PROVE_ALL and advertised via mgmt
destinations when the peer runs with probe_destination_enabled). The
PacketReceipt's delivery callback confirms reciprocal reachability;
the timeout callback marks every path going through this neighbor as
unresponsive, letting the existing announce-replacement logic swap in
a working route on the next fresh announce.
No wire-format changes. No new packet types. No HELLO protocol. No
changes to receiver-side proof_strategy or PROVE_* semantics. The
whole mechanism rides on existing primitives: probe_destination,
reverse_table, mark_path_responsive, mark_path_unresponsive, and
PacketReceipt callbacks.
Changes
PacketReceipt acquires a std::function-based callback variant so the
neighbor-probe outcome handlers can capture neighbor_hash by value. The
plain function-pointer setters remain for source compatibility with
existing firmware and are marked deprecated. As an incidental fix,
PacketReceipt::check_timeout previously set status to FAILED but left
the dispatch site as a commented-out thread stub from the Python
reference; the timeout callback never actually fired. It now invokes
synchronously, which the new feature depends on and which any prior
caller relying on timeout callbacks also benefits from.
ReverseEntry gains a _next_hop field so a returning proof can be
attributed to the neighbor that forwarded the original packet. The
construction site in Transport::inbound populates it from the next-hop
already in scope.
Transport gains a NeighborStat struct (packets forwarded, proofs
received, timestamps, probe-pending state, pending probe hash) and an
in-memory NeighborStatsTable keyed by neighbor hash. Counters are
window-relative and reset on successful probe completion or after
extended idle. Five tunables (suspicion window, min-packet threshold,
probe rate-limit, probe timeout, probe payload size) sit alongside the
existing Transport timing constants.
Hooks: outbound() increments packets_forwarded after a successful
transmit in all three forwarding branches; the proof-consumption block
increments proofs_received when a returning proof is transported back
through the reverse_table. jobs() runs a scan each tick that walks
neighbor_stats through five gates (sufficient activity, recent
forwarding, no recent proof, no probe in flight, rate limit ok),
snapshots candidates, then dispatches probes. Outcome handlers reset
counters on delivery and demote paths on timeout.
Reticulum gains two programmatic toggles with accessor pairs:
neighbor_probing_enabled (default on) gates the whole feature, and
neighbor_probing_path_request_fallback_enabled (default off) optionally
issues a path request when a suspect neighbor's probe destination is
not yet in the path table.
Lifecycle: remove_path and the legacy _path_table cull both erase the
matching neighbor_stats entry when a path is dropped, and the scan
itself resets accumulated counters for neighbors idle past twice the
suspicion window.
All new code is guarded by RNS_NEIGHBOR_PROBING (default on; set
-DRNS_NEIGHBOR_PROBING=0 in build_flags to compile out) and prefixed
with DIVERGENCE comments noting how each piece relates to the Python
reference. The check_timeout dispatch fix is unguarded because it is a
latent-bug fix, not a divergence.