Skip to content

Phase 2b retry loop: investigate jitter + interruptible sleep between attempts #3808

@iduartgomez

Description

@iduartgomez

Context

Surfaced by the code-first review of PR #3806 (#1454 Phase 2b), deferred with the note "worth confirming intentional."

The Phase 2b SUBSCRIBE driver in `operations/subscribe/op_ctx_task.rs::drive_client_subscribe_inner` goes straight from one failed attempt into the next `send_and_await` call with zero inter-attempt delay. That matches legacy in the sense that legacy also doesn't have explicit jitter between retries — but legacy serializes through the event loop, which provides incidental backoff: each retry has to wait for `notify_op_change` → `priority_select` → `handle_op_result` → `process_message` to cycle. Task-per-tx has no such incidental delay.

Under the right conditions — a contract that no peer is hosting, so every retry returns `NotFound` — a Phase 2b driver can burn through `MAX_BREADTH * MAX_RETRIES = 3 * 10 = 30` attempts essentially as fast as the network can deliver replies. That's a latent DoS vector against the topology-wide CPU/bandwidth budget for a single client subscribe.

Code-style rule

`.claude/rules/code-style.md` WHEN writing retry/backoff loops:

All retry/backoff loops MUST apply random jitter:
→ At least ±20% of the interval to prevent thundering herd

The rule presumes there IS a base interval. Phase 2b's driver has zero interval, so the rule isn't violated, but it isn't followed either — and the reviewer's concern is whether zero is actually the right choice.

Investigation questions

  1. Is the current behavior a regression relative to legacy? Need to measure the effective inter-retry delay in the legacy path under realistic load — is it 5 ms? 50 ms? 500 ms? Task-per-tx's zero-delay behavior is fine only if legacy's effective delay is also in that ballpark.
  2. What's the right interval? If legacy's effective delay is ~50 ms, the direct equivalent would be a `TrackedBackoff` per subscribe with base 50 ms. But we probably don't want full `TrackedBackoff` per-tx — too much state. A simpler `GlobalRng::random_f64_range(base0.8..base1.2)` jittered `deterministic_select!` sleep is probably enough.
  3. Interruptible sleep requirement: per the code-style rule, "Backoff sleeps MUST be interruptible: Use tokio::select! to race sleep against cancellation signal." Phase 2b drivers can be aborted by the task being dropped, which cancels `.await` points — but if there's a sleep between attempts, that sleep must respect cancellation. In practice a plain `sleep(duration).await` inside a spawned task is interruptible (drop aborts it), but the spirit of the rule suggests racing against an explicit cancellation signal too.
  4. DST determinism: if we add sleeps, they must use `deterministic_select!` (like the existing wait path in `subscribe.rs::wait_for_contract_with_timeout`) so the simulation runner advances virtual time.

Proposed (subject to investigation)

If legacy's effective delay is measurable and non-trivial:

```rust
// Between attempts in drive_client_subscribe_inner:
if retries > 0 || attempts_at_hop > 1 {
let base_ms = 20; // TBD based on legacy measurement
let jittered = base_ms as f64 * GlobalRng::random_f64_range(0.8..1.2);
let duration = Duration::from_millis(jittered as u64);
crate::deterministic_select! {
_ = tokio::time::sleep(duration) => {},
}
}
```

If legacy's effective delay is negligible: document "intentional no-delay" in the driver's module doc with a measurement citation, and add a `test_phase_2b_subscribe_fanout_cap` integration test that asserts a single client's NotFound-contract subscribe does not exceed some sensible per-second cap.

Scope

Small, isolated — one function to modify, one benchmark/measurement to gather first.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-developer-xpArea: developer experienceA-networkingArea: Networking, ring protocol, peer discoveryE-mediumExperience needed to fix/implement: Medium / intermediateS-needs-designStatus: Needs architectural design or RFCT-enhancementType: Improvement to existing functionality

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions