Summary
When a Bolt receiver returns a partial walk (the pairing-count register answered, but one slot's pairing-register read failed), the per-node NodeLedger replay grace and the per-device probe-cache eviction grace expire on the same tick. A device that recovers right at the end of the grace then needs a cold re-probe (a full feature-table walk) instead of recovering with a warm cache — defeating the "recover with a warm cache" intent of #222.
Low severity: no flap, no data loss. The cost is one extra feature-table walk on recovery, in a narrow failure window. Flagged by Greptile on the #222 thread.
Mechanism
#222 added two independent grace mechanisms, both with a 3-tick window:
NodeLedger (node_ledger.rs:32): replays a node's last-good inventory for NODE_MISS_GRACE = 3 consecutive failed probes (settle).
- Probe cache (
evict_unseen, CACHE_MISS_GRACE = 3 at inventory.rs:116): a cached device survives 3 ticks of not being in seen_keys before eviction.
In Enumerator::enumerate_reporting_health, outcomes.extend(probe.outcomes) runs unconditionally. On a partial Bolt walk:
- the slots that did read produce a
CacheOutcome → advance seen_keys → their cache stays warm;
- the slot that didn't read (its
get_device_pairing_information errored, so probe_bolt_slot returned None) produces no outcome → it's absent from seen_keys → its cache entry ages via evict_unseen.
Meanwhile the ledger replays the whole node's last-good (including the un-read device) for NODE_MISS_GRACE ticks. Because NODE_MISS_GRACE == CACHE_MISS_GRACE == 3, the replayed device's cache entry expires on the same tick the replay grace runs out. So on recovery after a sustained miss the un-read device is cold and repeats a full Device::new + enumerate_features walk.
(The total-timeout path — NodeProbe::failed() — emits no outcomes at all, so every device ages uniformly and this asymmetry doesn't arise. It is specific to a partial walk where some siblings read and one didn't.)
Relationship to #251
#251 (mirror the Unifying per-slot guard onto the Bolt slot probe) narrows this: after #251, a slot whose deep walk hangs falls back to cached/identity data and stays in paired → it produces a Seen outcome → its cache stays warm. The remaining surface is specifically a failed pairing-register read (get_device_pairing_information), which #251 does not wrap.
This should be re-evaluated only after #251 lands — its surface may shrink to a near-unreachable case, in which case this can be closed.
Proposed fix
Coordinate the two grace mechanisms: when NodeLedger::settle replays a node's last_good, mark that inventory's devices as seen so their probe-cache stays warm exactly as long as the ledger claims the devices still exist. Snag: PairedDevice doesn't carry its CacheKey, so the ledger (or the replay path) must surface the keys to the cache-eviction step.
Do not apply the naive "discard outcomes when !healthy" fix — that cold-ages the healthy sibling slots that read fine this tick, making the problem worse.
Refs #222, #251, #218. Related: #277 (the other #218-cluster one-shot/resilience tail).
Summary
When a Bolt receiver returns a partial walk (the pairing-count register answered, but one slot's pairing-register read failed), the per-node
NodeLedgerreplay grace and the per-device probe-cache eviction grace expire on the same tick. A device that recovers right at the end of the grace then needs a cold re-probe (a full feature-table walk) instead of recovering with a warm cache — defeating the "recover with a warm cache" intent of #222.Low severity: no flap, no data loss. The cost is one extra feature-table walk on recovery, in a narrow failure window. Flagged by Greptile on the #222 thread.
Mechanism
#222 added two independent grace mechanisms, both with a 3-tick window:
NodeLedger(node_ledger.rs:32): replays a node's last-good inventory forNODE_MISS_GRACE = 3consecutive failed probes (settle).evict_unseen,CACHE_MISS_GRACE = 3atinventory.rs:116): a cached device survives 3 ticks of not being inseen_keysbefore eviction.In
Enumerator::enumerate_reporting_health,outcomes.extend(probe.outcomes)runs unconditionally. On a partial Bolt walk:CacheOutcome→ advanceseen_keys→ their cache stays warm;get_device_pairing_informationerrored, soprobe_bolt_slotreturnedNone) produces no outcome → it's absent fromseen_keys→ its cache entry ages viaevict_unseen.Meanwhile the ledger replays the whole node's last-good (including the un-read device) for
NODE_MISS_GRACEticks. BecauseNODE_MISS_GRACE == CACHE_MISS_GRACE == 3, the replayed device's cache entry expires on the same tick the replay grace runs out. So on recovery after a sustained miss the un-read device is cold and repeats a fullDevice::new+enumerate_featureswalk.(The total-timeout path —
NodeProbe::failed()— emits no outcomes at all, so every device ages uniformly and this asymmetry doesn't arise. It is specific to a partial walk where some siblings read and one didn't.)Relationship to #251
#251 (mirror the Unifying per-slot guard onto the Bolt slot probe) narrows this: after #251, a slot whose deep walk hangs falls back to cached/identity data and stays in
paired→ it produces aSeenoutcome → its cache stays warm. The remaining surface is specifically a failed pairing-register read (get_device_pairing_information), which #251 does not wrap.This should be re-evaluated only after #251 lands — its surface may shrink to a near-unreachable case, in which case this can be closed.
Proposed fix
Coordinate the two grace mechanisms: when
NodeLedger::settlereplays a node'slast_good, mark that inventory's devices asseenso their probe-cache stays warm exactly as long as the ledger claims the devices still exist. Snag:PairedDevicedoesn't carry itsCacheKey, so the ledger (or the replay path) must surface the keys to the cache-eviction step.Do not apply the naive "discard
outcomeswhen!healthy" fix — that cold-ages the healthy sibling slots that read fine this tick, making the problem worse.Refs #222, #251, #218. Related: #277 (the other #218-cluster one-shot/resilience tail).