Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick

## Summary

When a Bolt receiver returns a **partial** walk (the pairing-count register answered, but one slot's pairing-register read failed), the per-node `NodeLedger` replay grace and the per-device probe-cache eviction grace expire on the *same* tick. A device that recovers right at the end of the grace then needs a **cold** re-probe (a full feature-table walk) instead of recovering with a warm cache — defeating the "recover with a warm cache" intent of #222.

Low severity: no flap, no data loss. The cost is one extra feature-table walk on recovery, in a narrow failure window. Flagged by Greptile on the #222 thread.

## Mechanism

#222 added two independent grace mechanisms, both with a 3-tick window:

- `NodeLedger` ([`node_ledger.rs:32`](https://github.com/AprilNEA/OpenLogi/blob/a82398665f9d7d27ce545d59b87df22c4d1c8205/crates/openlogi-hid/src/node_ledger.rs#L32)): replays a node's last-good inventory for `NODE_MISS_GRACE = 3` consecutive failed probes ([`settle`](https://github.com/AprilNEA/OpenLogi/blob/a82398665f9d7d27ce545d59b87df22c4d1c8205/crates/openlogi-hid/src/node_ledger.rs#L79)).
- Probe cache ([`evict_unseen`](https://github.com/AprilNEA/OpenLogi/blob/a82398665f9d7d27ce545d59b87df22c4d1c8205/crates/openlogi-hid/src/inventory.rs#L432), `CACHE_MISS_GRACE = 3` at [`inventory.rs:116`](https://github.com/AprilNEA/OpenLogi/blob/a82398665f9d7d27ce545d59b87df22c4d1c8205/crates/openlogi-hid/src/inventory.rs#L116)): a cached device survives 3 ticks of not being in `seen_keys` before eviction.

In `Enumerator::enumerate_reporting_health`, [`outcomes.extend(probe.outcomes)`](https://github.com/AprilNEA/OpenLogi/blob/a82398665f9d7d27ce545d59b87df22c4d1c8205/crates/openlogi-hid/src/inventory.rs#L397) runs unconditionally. On a partial Bolt walk:

- the slots that **did** read produce a `CacheOutcome` → advance `seen_keys` → their cache stays warm;
- the slot that **didn't** read (its `get_device_pairing_information` errored, so `probe_bolt_slot` returned `None`) produces **no** outcome → it's absent from `seen_keys` → its cache entry ages via `evict_unseen`.

Meanwhile the ledger replays the *whole node's* last-good (including the un-read device) for `NODE_MISS_GRACE` ticks. Because `NODE_MISS_GRACE == CACHE_MISS_GRACE == 3`, the replayed device's cache entry expires on the same tick the replay grace runs out. So on recovery after a sustained miss the un-read device is cold and repeats a full `Device::new` + `enumerate_features` walk.

(The total-timeout path — `NodeProbe::failed()` — emits *no* outcomes at all, so every device ages uniformly and this asymmetry doesn't arise. It is specific to a *partial* walk where some siblings read and one didn't.)

## Relationship to #251

#251 (mirror the Unifying per-slot guard onto the Bolt slot probe) **narrows** this: after #251, a slot whose *deep walk* hangs falls back to cached/identity data and stays in `paired` → it produces a `Seen` outcome → its cache stays warm. The remaining surface is specifically a failed **pairing-register** read (`get_device_pairing_information`), which #251 does not wrap.

**This should be re-evaluated only after #251 lands** — its surface may shrink to a near-unreachable case, in which case this can be closed.

## Proposed fix

Coordinate the two grace mechanisms: when `NodeLedger::settle` replays a node's `last_good`, mark that inventory's devices as `seen` so their probe-cache stays warm exactly as long as the ledger claims the devices still exist. Snag: `PairedDevice` doesn't carry its `CacheKey`, so the ledger (or the replay path) must surface the keys to the cache-eviction step.

**Do not** apply the naive "discard `outcomes` when `!healthy`" fix — that cold-ages the healthy sibling slots that read fine this tick, making the problem worse.

Refs #222, #251, #218. Related: #277 (the other #218-cluster one-shot/resilience tail).


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick #278

Summary

Mechanism

Relationship to #251

Proposed fix

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Bolt partial-walk recovery does a cold re-probe: ledger replay grace and probe-cache grace expire on the same tick #278

Description

Summary

Mechanism

Relationship to #251

Proposed fix

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions