Skip to content

Add passive neighbor-liveness inference with probe confirmation#64

Merged
attermann merged 10 commits into
masterfrom
neighbor_probe
Jun 24, 2026
Merged

Add passive neighbor-liveness inference with probe confirmation#64
attermann merged 10 commits into
masterfrom
neighbor_probe

Conversation

@attermann

Copy link
Copy Markdown
Owner

Problem

A transport node currently has no way to detect when a direct (hops==0)
neighbor is suffering asymmetric RF connectivity. The neighbor can still
transmit, so its announces continue to flow, path-table entries refresh
in the inbound handler, and age-based expiry never fires. But its
receive path silently drops the data we forward. Packets routed through
it disappear without trace.

The failure mode is worst for fire-and-forget PROVE_NONE traffic and
for paths used by quiet destinations that do not emit return traffic,
because nothing in the existing stack notices that forwarded packets
are never being acknowledged. Sustained blackhole forwarding continues
until the announce horizon eventually runs out, which in some modes
takes hours.

Solution

A local-only, passive inference layer on the transport side:

  • Track per-direct-neighbor counters: packets forwarded through this
    next-hop, and proofs that subsequently returned along the matching
    reverse-table route.

  • On each maintenance tick, evaluate suspicion: if a neighbor has had
    meaningful forwarding within the window but no proof returned, it is
    a candidate. Per-neighbor rate limits keep probe traffic bounded.

  • For each candidate, send one packet to the neighbor's existing
    probe_destination (already PROVE_ALL and advertised via mgmt
    destinations when the peer runs with probe_destination_enabled). The
    PacketReceipt's delivery callback confirms reciprocal reachability;
    the timeout callback marks every path going through this neighbor as
    unresponsive, letting the existing announce-replacement logic swap in
    a working route on the next fresh announce.

  • No wire-format changes. No new packet types. No HELLO protocol. No
    changes to receiver-side proof_strategy or PROVE_* semantics. The
    whole mechanism rides on existing primitives: probe_destination,
    reverse_table, mark_path_responsive, mark_path_unresponsive, and
    PacketReceipt callbacks.

Changes

PacketReceipt acquires a std::function-based callback variant so the
neighbor-probe outcome handlers can capture neighbor_hash by value. The
plain function-pointer setters remain for source compatibility with
existing firmware and are marked deprecated. As an incidental fix,
PacketReceipt::check_timeout previously set status to FAILED but left
the dispatch site as a commented-out thread stub from the Python
reference; the timeout callback never actually fired. It now invokes
synchronously, which the new feature depends on and which any prior
caller relying on timeout callbacks also benefits from.

ReverseEntry gains a _next_hop field so a returning proof can be
attributed to the neighbor that forwarded the original packet. The
construction site in Transport::inbound populates it from the next-hop
already in scope.

Transport gains a NeighborStat struct (packets forwarded, proofs
received, timestamps, probe-pending state, pending probe hash) and an
in-memory NeighborStatsTable keyed by neighbor hash. Counters are
window-relative and reset on successful probe completion or after
extended idle. Five tunables (suspicion window, min-packet threshold,
probe rate-limit, probe timeout, probe payload size) sit alongside the
existing Transport timing constants.

Hooks: outbound() increments packets_forwarded after a successful
transmit in all three forwarding branches; the proof-consumption block
increments proofs_received when a returning proof is transported back
through the reverse_table. jobs() runs a scan each tick that walks
neighbor_stats through five gates (sufficient activity, recent
forwarding, no recent proof, no probe in flight, rate limit ok),
snapshots candidates, then dispatches probes. Outcome handlers reset
counters on delivery and demote paths on timeout.

Reticulum gains two programmatic toggles with accessor pairs:
neighbor_probing_enabled (default on) gates the whole feature, and
neighbor_probing_path_request_fallback_enabled (default off) optionally
issues a path request when a suspect neighbor's probe destination is
not yet in the path table.

Lifecycle: remove_path and the legacy _path_table cull both erase the
matching neighbor_stats entry when a path is dropped, and the scan
itself resets accumulated counters for neighbors idle past twice the
suspicion window.

All new code is guarded by RNS_NEIGHBOR_PROBING (default on; set
-DRNS_NEIGHBOR_PROBING=0 in build_flags to compile out) and prefixed
with DIVERGENCE comments noting how each piece relates to the Python
reference. The check_timeout dispatch fix is unguarded because it is a
latent-bug fix, not a divergence.

attermann added 10 commits June 24, 2026 12:17
…eout dispatch

Introduce RNS_NEIGHBOR_PROBING feature flag (default on) for upcoming
passive neighbor-liveness inference work. Under this flag, PacketReceipt
gains std::function-based delivery/timeout handler setters that accept
capture-bearing callables (e.g. lambdas closing over local state); the
existing plain function-pointer setters remain for source compatibility
with out-of-tree firmware and are marked deprecated.

PacketReceipt::check_timeout() previously set status to FAILED but left
callback dispatch as a //z thread stub from the Python reference, so the
registered timeout callback was never invoked. Wire up synchronous
dispatch so the timeout-callback contract actually fires. The dispatcher
prefers the std::function handler when set, otherwise falls back to the
legacy function-pointer callback.

Add test_receipt_timeout_handler_capture covering both the capture-bearing
handler path and the timeout dispatch fix. All 171 existing tests
continue to pass on native17.
Add an _next_hop field to ReverseEntry so that when a proof comes back
along a forwarded route, the inbound proof-consumption code can identify
which direct neighbor forwarded the original packet. This is a building
block for passive neighbor-liveness inference: counting proofs returned
per neighbor lets us spot asymmetric RF connectivity where a neighbor
keeps transmitting but silently drops everything inbound.

The Python reference plan extends its reverse_table list with an
IDX_RT_NEXT_HOP slot; the C++ port uses named members on the existing
ReverseEntry class instead. Construction site in Transport::inbound is
updated to pass next_hop, which is already in scope from the forwarding
calculation a few lines above.

Gated on RNS_NEIGHBOR_PROBING; the no-feature build keeps the original
three-argument ReverseEntry constructor. All 171 native17 tests pass.
Wire up the data structures and configuration surface for passive
neighbor-liveness inference:

- Five NEIGHBOR_* tunables alongside the existing Transport timing
  constants in Type.h (suspicion window, min-packet threshold, probe
  rate-limit, probe timeout, probe payload size).
- NeighborStat struct + NeighborStatsTable using-alias in Transport.h,
  with a static _neighbor_stats member for the per-neighbor counters.
  In-memory only (ephemeral state; not microStore-backed). Uses
  ContainerAllocator so long-lived entries can live in the configured
  container memory pool.
- Two static-bool Reticulum toggles with accessor pairs:
  neighbor_probing_enabled (default true) and
  neighbor_probing_path_request_fallback_enabled (default false). The
  Python reference plan parses these from the reticulum INI block;
  microReticulum has no INI parser, so they are exposed as static
  accessors only.

All additions guarded by RNS_NEIGHBOR_PROBING with DIVERGENCE comments.
No new behavior yet; hooks land in subsequent commits.

All 171 native17 tests pass.
Add _record_neighbor_packet helper and call it from the three transmit
sites in Transport::outbound: the multi-hop forwarding branch, the
hops==1 shared-instance forwarding branch, and the direct-delivery
branch. The helper increments packets_forwarded and stamps last_packet_at
on the neighbor's NeighborStat entry, creating it on demand.

The next-hop attribution uses destination_entry._received_from in all
three branches. For multi-hop paths this is the transport node we are
handing the packet off to; for hops==0 (directly reachable) it is the
destination itself, which by definition is the immediate neighbor.

Only counts when transmit returned true so failed sends do not bias the
stats. Empty next-hop (broadcast, unknown reverse entry) is silently
ignored.

Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
Add _record_neighbor_proof helper and call it from the proof-forwarding
block in inbound packet handling. When a returning proof matches a
reverse_table entry and gets transported back along the original route,
attribute the proof to the neighbor recorded in reverse_entry._next_hop.

Together with the outbound packets_forwarded counter, this gives each
direct neighbor a ratio of proofs returned vs packets forwarded over the
current window. A sustained low ratio is the signal that drives later
suspicion + targeted probe dispatch.

Only increments the counter when an entry already exists in
_neighbor_stats; proofs for neighbors we never counted outbound
through are ignored.

Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
Wire the passive neighbor-liveness logic into jobs(). Each tick we
walk _neighbor_stats and gate each entry through five filters:
sufficient activity, recent forwarding within the suspicion window,
no recent proof return, no probe already in flight, and per-neighbor
probe rate limit. Surviving entries get a single probe dispatched to
the neighbor's built-in probe destination.

The probe path:
- Identity::recall lifts the neighbor's identity out of the announce
  store. If the identity is not yet known, log and skip.
- Destination::hash derives the probe-destination hash. If we have no
  path to it, log and skip; if the path-request fallback is enabled,
  optionally issue a path request and let a later tick retry.
- Otherwise construct an OUT/SINGLE destination, send a 16-byte random
  payload as a Packet, attach delivery and timeout handlers that
  capture the neighbor hash by value, and stamp probe_pending /
  last_probe_at on the stats entry.

Delivery handler clears probe_pending, resets the window counters, and
walks the path table calling mark_path_responsive on every entry whose
next-hop is this neighbor. Timeout handler clears probe_pending and
calls mark_path_unresponsive on every such entry, letting existing
announce-replacement logic swap in a working route on the next fresh
announce. Counter reset on timeout is deliberately not done so the
next cycle starts from the suspect state.

Scan builds a snapshot list before dispatching to avoid iterator
invalidation if a dispatch triggers any synchronous transport activity
that touches the stats map.

Gated on RNS_NEIGHBOR_PROBING and on the runtime triple
(transport_enabled, neighbor_probing_enabled, probe_destination_enabled).
All 171 native17 tests pass.
Round out the neighbor-stats lifecycle:

- _scan_neighbor_stats now resets packets_forwarded and proofs_received
  to zero when a neighbor has been idle past twice the suspicion window.
  Without this, a brief burst of forwarding followed by long quiet would
  leave stale counters that could spuriously trigger suspicion the next
  time traffic resumed.

- remove_path and the legacy _path_table cull both erase the matching
  _neighbor_stats entry when a destination's path record is dropped.
  For hops==0 paths the destination hash equals the neighbor hash so the
  stats entry is the one keyed by the same value; for hops greater than
  zero the erase is a no-op.

Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
Surface every substantive neighbor-probing event in the log stream so
operators can follow what the feature is doing without resorting to a
debugger. Level discipline:

- NOTICE — actionable failure: probe timed out, paths newly demoted to
  UNRESPONSIVE. This is what shows up by default and signals trouble.
- INFO — substantive lifecycle: neighbor classified as suspicious (with
  current counters and idle age), probe being sent (with payload size
  and timeout), probe-delivery success summary, path-request fallback
  firing.
- VERBOSE — per-path state transitions: paths individually promoted
  UNRESPONSIVE -> RESPONSIVE on probe success, or demoted *
  -> UNRESPONSIVE on probe failure. Stale-counter resets on long-idle
  neighbors.
- DEBUG — gate-skip diagnostics where the cause is interesting (identity
  not yet known, no path to peer's probe destination, first-time
  tracking of a new neighbor, stats erased due to path removal).
- TRACE — every counter increment and every skip reason during the
  per-tick scan (idle / insufficient activity / recent proof / probe
  pending / rate-limited).

The probe-delivered and probe-timed-out summaries now distinguish
"actual state transition" from "already-in-that-state no-op", so the
count in the summary reflects what really changed rather than how many
path-table entries the scan touched.

Path-state checks read _path_states before calling mark_path_*; only a
true UNRESPONSIVE->RESPONSIVE or non-UNRESPONSIVE->UNRESPONSIVE flip
emits a per-path VERBOSE line. The aggregate NOTICE/INFO at the end
reports both the transition count and the total matched.

Gated on RNS_NEIGHBOR_PROBING. All 171 native17 tests pass.
- Removed new packet std::function callbacks from RNS_NEIGHBOR_PROBING
  gating
- Added RNS_NEIGHBOR_PATH_REQUEST gating to replace runtime gating
@attermann attermann merged commit b8e529f into master Jun 24, 2026
8 checks passed
@attermann attermann deleted the neighbor_probe branch June 24, 2026 21:03
@nilu96

nilu96 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Hi attermann,

this is great! Passive direct neighbor tracking is exactly what I’ve been thinking about over the past few weeks. Here are some of my notes and ideas—hopefully, they help with further development.

The Goal

Identify stable, bidirectional connections so we can prioritize routes where we are confident the next hop will receive the packet.

Discarded Idea: Tracking Repeated Announces

Initially, I considered tracking repeated announces that a node recently broadcasted. However, announce behavior in a complex mesh is too chaotic for this to be reliable. Announces can be received from other nodes, and the rules around retry counters (decreasing on same hop count, dropping on hops+1) make it too unpredictable.

Proposed Tracking Mechanisms

Instead, here are three passive methods that might yield better results:

  • 1-Hop Announces: Listen only to announces for our own destinations where hops == 1. Since there are no alternative paths for a 1-hop announce, this should be a reliable way to track responsiveness of direct neighbors. BUT This only works to set neighbors to a RESPONSIVE state. If a nearby node heard an announce it might not repeat it if another node already repeated it before.
  • Passive Packet Monitoring: Listen for transport nodes to repeat regular packets. If a packet is sent to a next hop that is a transport node (not the final destination) and that node does not repeat it, mark the neighbor as UNRESPONSIVE.
  • Link Establishment Proofs (LRPROOF / PROOF):
    • If received: Mark neighbor as RESPONSIVE.
    • If missing: Mark as UNRESPONSIVE. (Note: This missing check only reliably works for LRPROOF when we are the last node before the final destination. Regular PROOF isn't mandatory, and LRPROOF could just be lost on earlier hops, if we are multiple hops away from final destination).

State Management

A strict binary state (RESPONSIVE / UNRESPONSIVE) might be too brittle. It could be better to:

  • Allow thresholds: Tolerate a certain amount of packet loss before flipping a neighbor to unresponsive.
  • Allow healing: Add a timeout mechanism so the UNRESPONSIVE flag can "heal" and revert after a certain period of time.

Routing Impacts

If a neighbor is flagged as UNRESPONSIVE, here is how I imagine it should impact routing rules:

Handling Announces from UNRESPONSIVE Neighbors:

  • Do not repeat announces from this node (since we know we can't reliably reach them).
  • Path Creation: If the destination is currently unknown, create a new path entry.
  • Path Updates:
    • DON'T overwrite an existing path that goes through a RESPONSIVE (or yet unclassified) neighbor, even if the UNRESPONSIVE route has fewer hops. (maybe use an internal hop penalty?)
    • DO update the path regularly (depending on hop count) if the existing route also relies on an UNRESPONSIVE neighbor.

These rules would allow that an announce that took a slightly longer but stable path to be preferred over a path via an unresponsive neighbor.

Handling Path Requests for UNRESPONSIVE Neighbors:

  • Repeat the path request, but do not answer it directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants