Context
Surfaced by the code-first review of PR #3806 (#1454 Phase 2b), deferred with the note "worth confirming intentional."
The Phase 2b SUBSCRIBE driver in `operations/subscribe/op_ctx_task.rs::drive_client_subscribe_inner` goes straight from one failed attempt into the next `send_and_await` call with zero inter-attempt delay. That matches legacy in the sense that legacy also doesn't have explicit jitter between retries — but legacy serializes through the event loop, which provides incidental backoff: each retry has to wait for `notify_op_change` → `priority_select` → `handle_op_result` → `process_message` to cycle. Task-per-tx has no such incidental delay.
Under the right conditions — a contract that no peer is hosting, so every retry returns `NotFound` — a Phase 2b driver can burn through `MAX_BREADTH * MAX_RETRIES = 3 * 10 = 30` attempts essentially as fast as the network can deliver replies. That's a latent DoS vector against the topology-wide CPU/bandwidth budget for a single client subscribe.
Code-style rule
`.claude/rules/code-style.md` WHEN writing retry/backoff loops:
All retry/backoff loops MUST apply random jitter:
→ At least ±20% of the interval to prevent thundering herd
The rule presumes there IS a base interval. Phase 2b's driver has zero interval, so the rule isn't violated, but it isn't followed either — and the reviewer's concern is whether zero is actually the right choice.
Investigation questions
- Is the current behavior a regression relative to legacy? Need to measure the effective inter-retry delay in the legacy path under realistic load — is it 5 ms? 50 ms? 500 ms? Task-per-tx's zero-delay behavior is fine only if legacy's effective delay is also in that ballpark.
- What's the right interval? If legacy's effective delay is ~50 ms, the direct equivalent would be a `TrackedBackoff` per subscribe with base 50 ms. But we probably don't want full `TrackedBackoff` per-tx — too much state. A simpler `GlobalRng::random_f64_range(base0.8..base1.2)` jittered `deterministic_select!` sleep is probably enough.
- Interruptible sleep requirement: per the code-style rule, "Backoff sleeps MUST be interruptible: Use tokio::select! to race sleep against cancellation signal." Phase 2b drivers can be aborted by the task being dropped, which cancels `.await` points — but if there's a sleep between attempts, that sleep must respect cancellation. In practice a plain `sleep(duration).await` inside a spawned task is interruptible (drop aborts it), but the spirit of the rule suggests racing against an explicit cancellation signal too.
- DST determinism: if we add sleeps, they must use `deterministic_select!` (like the existing wait path in `subscribe.rs::wait_for_contract_with_timeout`) so the simulation runner advances virtual time.
Proposed (subject to investigation)
If legacy's effective delay is measurable and non-trivial:
```rust
// Between attempts in drive_client_subscribe_inner:
if retries > 0 || attempts_at_hop > 1 {
let base_ms = 20; // TBD based on legacy measurement
let jittered = base_ms as f64 * GlobalRng::random_f64_range(0.8..1.2);
let duration = Duration::from_millis(jittered as u64);
crate::deterministic_select! {
_ = tokio::time::sleep(duration) => {},
}
}
```
If legacy's effective delay is negligible: document "intentional no-delay" in the driver's module doc with a measurement citation, and add a `test_phase_2b_subscribe_fanout_cap` integration test that asserts a single client's NotFound-contract subscribe does not exceed some sensible per-second cap.
Scope
Small, isolated — one function to modify, one benchmark/measurement to gather first.
Related
Context
Surfaced by the code-first review of PR #3806 (#1454 Phase 2b), deferred with the note "worth confirming intentional."
The Phase 2b SUBSCRIBE driver in `operations/subscribe/op_ctx_task.rs::drive_client_subscribe_inner` goes straight from one failed attempt into the next `send_and_await` call with zero inter-attempt delay. That matches legacy in the sense that legacy also doesn't have explicit jitter between retries — but legacy serializes through the event loop, which provides incidental backoff: each retry has to wait for `notify_op_change` → `priority_select` → `handle_op_result` → `process_message` to cycle. Task-per-tx has no such incidental delay.
Under the right conditions — a contract that no peer is hosting, so every retry returns `NotFound` — a Phase 2b driver can burn through `MAX_BREADTH * MAX_RETRIES = 3 * 10 = 30` attempts essentially as fast as the network can deliver replies. That's a latent DoS vector against the topology-wide CPU/bandwidth budget for a single client subscribe.
Code-style rule
`.claude/rules/code-style.md` WHEN writing retry/backoff loops:
The rule presumes there IS a base interval. Phase 2b's driver has zero interval, so the rule isn't violated, but it isn't followed either — and the reviewer's concern is whether zero is actually the right choice.
Investigation questions
Proposed (subject to investigation)
If legacy's effective delay is measurable and non-trivial:
```rust
// Between attempts in drive_client_subscribe_inner:
if retries > 0 || attempts_at_hop > 1 {
let base_ms = 20; // TBD based on legacy measurement
let jittered = base_ms as f64 * GlobalRng::random_f64_range(0.8..1.2);
let duration = Duration::from_millis(jittered as u64);
crate::deterministic_select! {
_ = tokio::time::sleep(duration) => {},
}
}
```
If legacy's effective delay is negligible: document "intentional no-delay" in the driver's module doc with a measurement citation, and add a `test_phase_2b_subscribe_fanout_cap` integration test that asserts a single client's NotFound-contract subscribe does not exceed some sensible per-second cap.
Scope
Small, isolated — one function to modify, one benchmark/measurement to gather first.
Related