Skip to content

Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick #278

Description

@AprilNEA

Summary

When a Bolt receiver returns a partial walk (the pairing-count register answered, but one slot's pairing-register read failed), the per-node NodeLedger replay grace and the per-device probe-cache eviction grace expire on the same tick. A device that recovers right at the end of the grace then needs a cold re-probe (a full feature-table walk) instead of recovering with a warm cache — defeating the "recover with a warm cache" intent of #222.

Low severity: no flap, no data loss. The cost is one extra feature-table walk on recovery, in a narrow failure window. Flagged by Greptile on the #222 thread.

Mechanism

#222 added two independent grace mechanisms, both with a 3-tick window:

  • NodeLedger (node_ledger.rs:32): replays a node's last-good inventory for NODE_MISS_GRACE = 3 consecutive failed probes (settle).
  • Probe cache (evict_unseen, CACHE_MISS_GRACE = 3 at inventory.rs:116): a cached device survives 3 ticks of not being in seen_keys before eviction.

In Enumerator::enumerate_reporting_health, outcomes.extend(probe.outcomes) runs unconditionally. On a partial Bolt walk:

  • the slots that did read produce a CacheOutcome → advance seen_keys → their cache stays warm;
  • the slot that didn't read (its get_device_pairing_information errored, so probe_bolt_slot returned None) produces no outcome → it's absent from seen_keys → its cache entry ages via evict_unseen.

Meanwhile the ledger replays the whole node's last-good (including the un-read device) for NODE_MISS_GRACE ticks. Because NODE_MISS_GRACE == CACHE_MISS_GRACE == 3, the replayed device's cache entry expires on the same tick the replay grace runs out. So on recovery after a sustained miss the un-read device is cold and repeats a full Device::new + enumerate_features walk.

(The total-timeout path — NodeProbe::failed() — emits no outcomes at all, so every device ages uniformly and this asymmetry doesn't arise. It is specific to a partial walk where some siblings read and one didn't.)

Relationship to #251

#251 (mirror the Unifying per-slot guard onto the Bolt slot probe) narrows this: after #251, a slot whose deep walk hangs falls back to cached/identity data and stays in paired → it produces a Seen outcome → its cache stays warm. The remaining surface is specifically a failed pairing-register read (get_device_pairing_information), which #251 does not wrap.

This should be re-evaluated only after #251 lands — its surface may shrink to a near-unreachable case, in which case this can be closed.

Proposed fix

Coordinate the two grace mechanisms: when NodeLedger::settle replays a node's last_good, mark that inventory's devices as seen so their probe-cache stays warm exactly as long as the ledger claims the devices still exist. Snag: PairedDevice doesn't carry its CacheKey, so the ledger (or the replay path) must surface the keys to the cache-eviction step.

Do not apply the naive "discard outcomes when !healthy" fix — that cold-ages the healthy sibling slots that read fine this tick, making the problem worse.

Refs #222, #251, #218. Related: #277 (the other #218-cluster one-shot/resilience tail).

Metadata

Metadata

Assignees

No one assigned

    Labels

    area: hidHID device discovery, permissions, reads, or writesstatus: blockedBlocked by another issue, dependency, or external constrainttype: enhancementImprovement to existing functionality

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions