Skip to content

fix: schedule reconnect after parse error during streaming#134

Merged
kinyoklion merged 3 commits into
mainfrom
rl/sdk-2345/parser-error-reconnect-state
May 11, 2026
Merged

fix: schedule reconnect after parse error during streaming#134
kinyoklion merged 3 commits into
mainfrom
rl/sdk-2345/parser-error-reconnect-state

Conversation

@kinyoklion

@kinyoklion kinyoklion commented May 8, 2026

Copy link
Copy Markdown
Member

Summary

Every error path in ReconnectingRequest::poll_next during the Connected state sets state to WaitingToReconnect before returning the error — except for parser errors propagated through process_bytes(...)?. After a parse error, the state stayed at Connected and the next poll continued draining the (already-broken) response body before finally hitting end-of-body and emitting a spurious UnexpectedEof before the reconnect actually kicked in.

This PR schedules a reconnect on parse errors when reconnect is enabled, matching the pattern used by every other error path. The error is still returned to the caller; the only behavioral change is that the next poll cleanly transitions to reconnect rather than draining the broken body.

Before: Connected, Err(InvalidLine), Err(UnexpectedEof), Connected, …
After: Connected, Err(InvalidLine), Connected, …

Context

Surfaced while investigating launchdarkly/rust-server-sdk#116 (StreamingDataSource silently shuts down on stream errors). That report has two contributing bugs; this PR addresses the rust-eventsource-client side. The rust-server-sdk side will be fixed in a follow-up PR — together they restore the reconnection contract: the eventsource-client owns reconnection, the SDK keeps polling and trusts it.

Tracked in SDK-2345.

Test plan

  • New test parser_error_schedules_reconnect_immediately (eventsource-client/src/client.rs) — verifies the next stream item after a parse error is Connected from the reconnect, not a spurious UnexpectedEof.
  • cargo test — 60 lib tests + 2 doc tests pass; no regressions.

Note

Medium Risk
Changes streaming error-handling semantics by scheduling reconnects on parser failures, which can affect how downstream consumers observe errors and reconnect timing. Scope is small and covered by a new integration-style test, but touches core stream state transitions.

Overview
Fixes reconnection behavior after SSE parse errors during streaming. When EventParser::process_bytes fails in the Connected state, the client now (when ReconnectOptions::reconnect is enabled) transitions to WaitingToReconnect before yielding the parse error, avoiding continued draining of a broken response body and the resulting follow-on errors.

Updates Client::stream docs to clarify that stream errors are non-terminal and that the stream only ends on Poll::Ready(None), and adds a new test (parser_error_schedules_reconnect_immediately) asserting the sequence Connected -> Err(InvalidLine) -> Connected.

Reviewed by Cursor Bugbot for commit ddea87d. Bugbot is set up for automated code reviews on this repo. Configure here.

In `ReconnectingRequest::poll_next`, parser errors propagated through
`process_bytes(...)?` while in the `Connected` state previously
returned the error without scheduling a reconnect. The next poll
continued draining the broken response body before finally hitting
end-of-body and emitting a spurious `UnexpectedEof` before the
reconnect kicked in.

Schedule a reconnect on parse errors when reconnect is enabled. The
error is still returned to the caller; the only behavioral change is
that the next poll cleanly transitions to reconnect rather than
draining the broken body.

Adds `parser_error_schedules_reconnect_immediately` test verifying
that the next stream item after a parse error is the reconnect's
`Connected`, not an extra error.

Contributes to launchdarkly/rust-server-sdk#116.
@kinyoklion kinyoklion force-pushed the rl/sdk-2345/parser-error-reconnect-state branch from c7ae337 to 7917038 Compare May 8, 2026 21:32
@kinyoklion kinyoklion marked this pull request as ready for review May 8, 2026 21:39
@kinyoklion kinyoklion requested a review from a team as a code owner May 8, 2026 21:39
@kinyoklion kinyoklion changed the title fix: schedule reconnect after parse error during streaming [SDK-2345] fix: schedule reconnect after parse error during streaming May 8, 2026
- Match `Error::InvalidLine(_)` specifically so a regression that
  produces a different error class is caught.
- Replace `tokio::time::timeout(...).ok()` with `.expect(...)` so a
  flaky timeout produces a clear failure mode rather than a silent
  "got None" further down.
- Bump the timeout to 2 s to absorb slower CI runners.
- Drop the decorative `expect_at_least(2)` -- mockito's default
  `assert_on_drop` is false, so it enforced nothing; the
  `items[2] == Connected` assertion already requires the reconnect.
The previous docstring told consumers "Do not use the stream after it
returned an error!" That instruction contradicts how the stream
actually works: most errors are non-terminal, and the reconnect
machinery only runs if the consumer keeps polling. Stream
termination is signalled by `Poll::Ready(None)`, not by an error.

Reword the docstring to describe the real contract: errors are
informational, keep polling, the stream ends when poll_next returns
None.
@kinyoklion kinyoklion merged commit 2ac2998 into main May 11, 2026
10 checks passed
@kinyoklion kinyoklion deleted the rl/sdk-2345/parser-error-reconnect-state branch May 11, 2026 15:31
kinyoklion added a commit that referenced this pull request May 11, 2026
Three pre-existing reconnect-control gaps surfaced during the multi-
agent review of #134 (SDK-2347):

1. `BackoffRetry::change_base_delay` accepted any duration, including
   `Duration::ZERO`. A server emitting `retry: 0` collapsed the
   backoff to zero across every reconnect path, producing a tight
   reconnect loop. Clamp the input to a 1 ms floor.

2. The EOF arm of `ReconnectingRequest::poll_next` unconditionally
   scheduled a reconnect, even when `reconnect_opts.reconnect` was
   false. Honor the flag and transition to `StreamClosed` when
   reconnect is disabled, matching every other error path.

3. The parse-error arm only transitioned state when reconnect was
   enabled. With reconnect disabled, the parser stayed poisoned and
   the next poll drained to EOF, where (1) above papered over the
   bug. Transition to `StreamClosed` so the documented "do not use
   the stream after error" contract holds.

Tests:
- `test_change_base_delay_clamps_to_minimum` pins the retry floor.
- `parser_error_closes_stream_when_reconnect_disabled` asserts the
  stream returns `None` after a parse error when reconnect is off.
- `eof_closes_stream_when_reconnect_disabled` asserts the stream
  returns `None` after end-of-body when reconnect is off.
kinyoklion added a commit that referenced this pull request May 11, 2026
## Summary

Stacked on top of #134. Addresses three pre-existing reconnect-control
gaps surfaced during the multi-agent review of that PR.

1. **`retry: 0` collapses backoff to zero.** A server emitting `retry:
0` set `BackoffRetry::base_delay` to `Duration::ZERO`, after which every
reconnect path (`client.rs:474, 525, 547, 612` and the new parse-error
path) computed `next_delay() == 0` and reconnected immediately — a tight
loop. Clamps `change_base_delay` to a 1 ms floor.

2. **EOF arm ignored `reconnect_opts.reconnect`.** The body-exhausted
branch unconditionally scheduled `WaitingToReconnect` even when
reconnect was disabled. Now honors the flag and transitions to
`StreamClosed`, matching every other error path.

3. **Parse-error arm with `reconnect=false` left the parser poisoned.**
No state transition happened, so the next poll drained the broken body
to EOF — where (2) above papered over the bug. Now transitions to
`StreamClosed` so the documented \"do not use the stream after error\"
contract holds.

## Context

Stacked PR — base is the `rl/sdk-2345/parser-error-reconnect-state`
branch from #134, **not** main. Will retarget once #134 lands.

Tracked in
[SDK-2347](https://launchdarkly.atlassian.net/browse/SDK-2347).
Predecessor:
[SDK-2345](https://launchdarkly.atlassian.net/browse/SDK-2345) / #134.

Surfaced from the multi-agent review of #134 (findings 1, 2, and the
corresponding suggested follow-ups 4 and 5). All three were pre-existing
— #134 only made the parse-error path more visible.

## Test plan

- [x] `test_change_base_delay_clamps_to_minimum` (`retry.rs`) — pins the
1 ms floor against `change_base_delay(Duration::ZERO)` and a sub-floor
value.
- [x] `parser_error_closes_stream_when_reconnect_disabled` (`client.rs`)
— asserts the stream emits one `InvalidLine` then `None` when reconnect
is off.
- [x] `eof_closes_stream_when_reconnect_disabled` (`client.rs`) —
asserts the stream emits the event, `Eof`, then `None` when reconnect is
off.
- [x] Existing `parser_error_schedules_reconnect_immediately` from #134
still passes.
- [x] `cargo test` — 63 lib tests + 1 doc test pass.
- [x] `cargo fmt --check` clean.

[SDK-2347]:
https://launchdarkly.atlassian.net/browse/SDK-2347?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
kinyoklion pushed a commit that referenced this pull request May 11, 2026
🤖 I have created a release *beep* *boop*
---


##
[0.17.4](0.17.3...0.17.4)
(2026-05-11)


### Bug Fixes

* respect reconnect=false and clamp server-supplied retry: 0
([#135](#135))
([80c123a](80c123a))
* schedule reconnect after parse error during streaming
([#134](#134))
([2ac2998](2ac2998))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Low Risk**
> Low risk: this PR only updates release metadata (version bump and
changelog) and does not modify runtime code paths.
> 
> **Overview**
> Bumps `eventsource-client` from `0.17.3` to `0.17.4` in
`.release-please-manifest.json` and `eventsource-client/Cargo.toml`.
> 
> Updates `eventsource-client/CHANGELOG.md` with the `0.17.4` release
notes (reconnect/retry behavior fixes and reconnect scheduling after
parse errors).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
cab73e7. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
kinyoklion added a commit to launchdarkly/rust-server-sdk that referenced this pull request May 11, 2026
## Summary

The `StreamingDataSource` fetch loop matched any error other than `Eof`
(and recoverable HTTP responses) with a catch-all `_ => break`, dropping
the spawned task and leaving the data source silently dysfunctional.
Because nothing was polling the eventsource-client's stream anymore, its
reconnect logic never ran. The `None` branch also broke without
notifying the caller.

This PR:

- Replaces the catch-all `break` with `continue`, so the
eventsource-client gets to reconnect on the next poll. A warning log
keeps transient errors visible.
- Calls `init_complete(false)` in the `None` branch before breaking, so
callers waiting on initialization get a signal when the stream is
permanently closed.

## Context

Surfaced from
[#116](#116).
That report has two contributing bugs; this PR addresses the
rust-server-sdk side. The matching rust-eventsource-client side is in
[launchdarkly/rust-eventsource-client#134](launchdarkly/rust-eventsource-client#134)
([SDK-2345](https://launchdarkly.atlassian.net/browse/SDK-2345)).
Together they restore the reconnection contract: the eventsource-client
owns reconnection, and the SDK keeps polling and trusts it.

Tracked in
[SDK-2346](https://launchdarkly.atlassian.net/browse/SDK-2346).

## Test plan

- [x] New test `streaming_source_recovers_from_non_eof_stream_error`
(`launchdarkly-server-sdk/src/data_source.rs`) — sends invalid UTF-8
that triggers `Error::InvalidLine` from the eventsource-client parser
and asserts the mock receives at least two requests, proving the SDK
kept polling and the underlying client reconnected.
- [x] `cargo test --package launchdarkly-server-sdk --lib` — 300 tests
pass; no regressions.
- [x] `cargo fmt --check` clean.

Closes #116.

[SDK-2345]:
https://launchdarkly.atlassian.net/browse/SDK-2345?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> **Medium Risk**
> Touches core streaming update loop behavior; a mistake could cause
retry storms or incorrect initialization signaling, though scope is
limited and covered by new tests.
> 
> **Overview**
> Fixes `StreamingDataSource` to **stay alive on non-EOF SSE errors** by
continuing the fetch loop (allowing the underlying eventsource client to
reconnect) and downgrading the failure to a warning.
> 
> When the event stream is permanently exhausted (`None`), it now
**signals initialization failure** via `init_complete(false)` before
exiting. Adds tests to verify reconnect attempts on parser errors and
stopping/signaling failure on unrecoverable HTTP statuses (e.g. `401`).
> 
> <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit
f1f3cc4. Bugbot is set up for automated
code reviews on this repo. Configure
[here](https://www.cursor.com/dashboard/bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants