Phase 2b retry loop: investigate jitter + interruptible sleep between attempts

## Context

Surfaced by the code-first review of PR #3806 (#1454 Phase 2b), deferred with the note "worth confirming intentional."

The Phase 2b SUBSCRIBE driver in \`operations/subscribe/op_ctx_task.rs::drive_client_subscribe_inner\` goes straight from one failed attempt into the next \`send_and_await\` call with zero inter-attempt delay. That matches legacy in the sense that legacy also doesn't have explicit jitter between retries — but legacy serializes through the event loop, which provides *incidental* backoff: each retry has to wait for \`notify_op_change\` → \`priority_select\` → \`handle_op_result\` → \`process_message\` to cycle. Task-per-tx has no such incidental delay.

Under the right conditions — a contract that no peer is hosting, so every retry returns \`NotFound\` — a Phase 2b driver can burn through \`MAX_BREADTH * MAX_RETRIES = 3 * 10 = 30\` attempts essentially as fast as the network can deliver replies. That's a latent DoS vector against the topology-wide CPU/bandwidth budget for a single client subscribe.

## Code-style rule

\`.claude/rules/code-style.md\` WHEN writing retry/backoff loops:

> All retry/backoff loops MUST apply random jitter:
>   → At least ±20% of the interval to prevent thundering herd

The rule presumes there IS a base interval. Phase 2b's driver has zero interval, so the rule isn't violated, but it isn't followed either — and the reviewer's concern is whether zero is actually the right choice.

## Investigation questions

1. **Is the current behavior a regression relative to legacy?** Need to measure the *effective* inter-retry delay in the legacy path under realistic load — is it 5 ms? 50 ms? 500 ms? Task-per-tx's zero-delay behavior is fine only if legacy's effective delay is also in that ballpark.
2. **What's the right interval?** If legacy's effective delay is ~50 ms, the direct equivalent would be a \`TrackedBackoff<Transaction>\` per subscribe with base 50 ms. But we probably don't want full \`TrackedBackoff\` per-tx — too much state. A simpler \`GlobalRng::random_f64_range(base*0.8..base*1.2)\` jittered \`deterministic_select!\` sleep is probably enough.
3. **Interruptible sleep requirement**: per the code-style rule, "Backoff sleeps MUST be interruptible: Use tokio::select! to race sleep against cancellation signal." Phase 2b drivers can be aborted by the task being dropped, which cancels \`.await\` points — but if there's a sleep between attempts, that sleep must respect cancellation. In practice a plain \`sleep(duration).await\` inside a spawned task is interruptible (drop aborts it), but the spirit of the rule suggests racing against an explicit cancellation signal too.
4. **DST determinism**: if we add sleeps, they must use \`deterministic_select!\` (like the existing wait path in \`subscribe.rs::wait_for_contract_with_timeout\`) so the simulation runner advances virtual time.

## Proposed (subject to investigation)

If legacy's effective delay is measurable and non-trivial:

\`\`\`rust
// Between attempts in drive_client_subscribe_inner:
if retries > 0 || attempts_at_hop > 1 {
    let base_ms = 20; // TBD based on legacy measurement
    let jittered = base_ms as f64 * GlobalRng::random_f64_range(0.8..1.2);
    let duration = Duration::from_millis(jittered as u64);
    crate::deterministic_select! {
        _ = tokio::time::sleep(duration) => {},
    }
}
\`\`\`

If legacy's effective delay is negligible: document "intentional no-delay" in the driver's module doc with a measurement citation, and add a \`test_phase_2b_subscribe_fanout_cap\` integration test that asserts a single client's NotFound-contract subscribe does not exceed some sensible per-second cap.

## Scope

Small, isolated — one function to modify, one benchmark/measurement to gather first.

## Related

- PR #3806 code-first review #7
- \`.claude/rules/code-style.md\` WHEN writing retry/backoff loops

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Phase 2b retry loop: investigate jitter + interruptible sleep between attempts #3808

Context

Code-style rule

Investigation questions

Proposed (subject to investigation)

Scope

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Phase 2b retry loop: investigate jitter + interruptible sleep between attempts #3808

Description

Context

Code-style rule

Investigation questions

Proposed (subject to investigation)

Scope

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions