Skip to content

Add retry with backoff for initial snapshot fetch failure#708

Open
iamnbutler wants to merge 1 commit intomainfrom
tasks/gh-iamnbutler-tasks-issue-490--309ed925
Open

Add retry with backoff for initial snapshot fetch failure#708
iamnbutler wants to merge 1 commit intomainfrom
tasks/gh-iamnbutler-tasks-issue-490--309ed925

Conversation

@iamnbutler
Copy link
Copy Markdown
Owner

Summary

  • When the initial snapshot fetch fails (server not yet running), the desktop app now automatically retries with exponential backoff (1s → 2s → 4s → 8s → 16s → 30s cap) instead of showing an empty state forever
  • Dashboard shows connection-aware UI: "Connecting to server...", "Reconnecting to server..." with error details, instead of the unhelpful "No data yet" message
  • Added a manual "Retry" button visible during reconnecting/failed/disconnected states that resets the backoff and immediately retries

Test plan

  • Start the desktop app without the server running — should show "Connecting to server..." then "Reconnecting to server..." with error details and a Retry button
  • Click the Retry button — should reset to "Connecting to server..." and attempt a fresh connection
  • Start the server while the app is retrying — should connect and show the dashboard
  • Verify normal startup (server already running) still works as before

Closes #490

🤖 Generated with Claude Code

When the server isn't running at startup, the desktop app now retries
with exponential backoff (1s, 2s, 4s... up to 30s) instead of showing
a permanent empty state. The dashboard displays connection status
("Connecting...", "Reconnecting...") with error details and a manual
Retry button when in a failed/reconnecting state.

Closes #490

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR adds exponential backoff for the initial snapshot fetch in the desktop app, with a "Connecting/Reconnecting" status display and a manual Retry button. The change is contained to crates/desktop/ (two files, ~120 lines added), the core logic is correct, and the UX improvements are well-targeted.

No important issues. Three suggestions are left inline.

Build notes:

  • cargo test --workspace: 1 failing test — server::tests::reject_merge_entry_closed_marks_task_completed — is pre-existing from the previous commit (d6f2fe1, #694), not caused by this PR. The test panics on an unwrap() in server.rs:3169.
  • cargo clippy --workspace -- -D warnings: 2 pre-existing errors in crates/models/ (floor_char_boundary MSRV and should_implement_trait), also not caused by this PR.
  • Desktop crate (crates/desktop/) is excluded from the workspace test run (GPUI dependencies), so the new retry logic has no automated test coverage — but that's a pre-existing gap in the test setup, not specific to this PR.

Summary of inline comments:

  1. Backoff redundant with polling loop (state.rs:411) — The 5s polling loop independently calls update_snapshot on failure and spawns its own retry timers, so the rate-limiting intent of the backoff doesn't hold in steady state. Suggestion only: the fetch_in_flight guard caps the actual HTTP rate regardless.
  2. Stale timers survive retry_connection() (state.rs:346) — Detached timers from old backoff cycles can fire after a manual Retry, briefly flipping status from Connecting back to Reconnecting. A generation counter would prevent this.
  3. Getter in wrong section (state.rs:337) — snapshot_retry_count() is placed in the // --- Setters --- block; belongs in // --- Getters ---.

Reviewed by PR / Review

if !had_snapshot {
self.snapshot_retry_count += 1;
let delay_ms = INITIAL_RETRY_DELAY_MS
.saturating_mul(1 << self.snapshot_retry_count.min(5))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]

Priority: Code Quality

The backoff timers are redundant with the 5s polling loop during the pre-connection phase. When both are active, each polling tick also calls update_snapshot and, on failure, spawns its own new retry timer. Over time this accumulates multiple concurrent timers all scheduled at the max 30s interval — each one that fires (and fails) schedules yet another 30s timer. The fetch_in_flight guard prevents duplicate HTTP requests, but the effective retry rate in steady state is dominated by the 5s polling, not the 30s backoff cap.

The retry timers do add value for the very first window (2s faster than polling's 5s), but the rate-limiting intent of the backoff doesn't hold once the polling loop joins in.

One cleaner approach: only schedule a retry from update_snapshot when snapshot_retry_count == 1 (i.e., skip the retry timer on polling failures, only use it for the very first failure). Or track a flag like retry_timer_pending to skip scheduling a new one when one is already active.


/// Manually retry connecting to the server (resets backoff).
pub fn retry_connection(&mut self, cx: &mut Context<Self>) {
self.snapshot_retry_count = 0;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]

Priority: Correctness

retry_connection() resets snapshot_retry_count to 0 and sets status to Connecting, but any previously-detached retry timers remain active. If a stale timer fires after this reset (while the manually-triggered fetch is still in-flight or has already completed), update_snapshot on that path will flip the status back to Reconnecting — causing a visible flicker from ConnectingReconnecting shortly after the user clicks Retry.

Since GPUI tasks are spawned with .detach(), there's no cancellation handle. A simple mitigation is a generation counter:

retry_generation: u32,  // bumped in retry_connection()

Pass the current generation into the timer closure, and in the closure, bail out if the generation has changed:

cx.spawn(async move |this, cx| {
    smol::Timer::after(delay).await;
    this.update(cx, |state, cx| {
        if state.retry_generation == generation {
            state.refresh_snapshot(cx);
        }
    }).ok();
}).detach();

}
}

/// Get the current snapshot retry count.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION]

Priority: Code Quality

snapshot_retry_count() is a getter but it's placed in the // --- Setters --- section (between set_selected_project and // --- Actions ---). Move it up to join the other getters under // --- Getters ---.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Desktop app: no retry on initial snapshot failure — stuck on empty state

1 participant