Host/probe reliability fixes + pure view-fold refactor by paulocorcino · Pull Request #48 · paulocorcino/devtunnel_gui

paulocorcino · 2026-06-19T00:33:27Z

Bundles the host/probe reliability work on this branch with the view-layer refactor.

Highlights

Pure view-fold module (#42)

Extracts the view reconciliation out of rebuild_rows into a pure src/view.rs with one entry point, fold(&FoldInput) -> FoldOutput, free of any Slint / channel / Rc<RefCell> dependency. rebuild_rows becomes a thin adapter. 14 new table-driven tests cover probe badge mapping, optimistic-delete hiding, placeholder folding, the hosting pill, and detail-panel reconciliation. Zero behavior change.

Host / probe reliability (#35, #37, #38, #39, #44, #45, #47)

Keep-alive driven by a pure, table-tested state machine (Keep-alive state machine extraction #35).
Stop the reconnect loop on non-recoverable connect errors; forward each port's configured protocol (Stop reconnect loop on non-recoverable connect errors #43, shipped as 7aa2021).
Reuse host/manage tokens across reconnects instead of re-minting; mint tokens concurrently (Measure the 20h re-mint blip #38, Reuse host/manage tokens across reconnects instead of re-minting every attempt #47).
Probe-down watchdog policy + zombie-tunnel signature instrumentation (Zombie-tunnel evidence gate (probe vs engine Hosting) #37, Probe→reconnect watchdog #39).
Surface connect sub-phases; fetch only the hosted tunnel's ports via a single show (Cache collect_ports from the create step instead of a fresh list/show per connect #44, Emit a Connecting sub-progress so a 14-18s connect does not read as hung #45).
Headless host runner + blackbox resilience suite (e2e).

Verification

cargo build and cargo build --features hosting: both pass.
cargo test: 93 (default) / 102 (hosting) pass.
cargo clippy + cargo fmt --check: clean on both feature sets.
e2e: GUI loaded a live tunnel through the new fold path and rendered without panic.

🤖 Generated with Claude Code

Extract the host engine's keep-alive policy (reconnect backoff, token re-mint timing, auth-error relogin path) into a pure, dependency-free state machine in src/host/keepalive.rs. Declared unconditionally so its tests run without the vendored-OpenSSL toolchain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Cover backoff progression (2,4,8,16,32,60,60) and reset-on-success, re-mint scheduling, auth-error to relogin, and reconnect phase change. Verified RED against a todo!() next() then GREEN once implemented. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Rewrite host_group as a thin driver around KeepAliveState: it maps the Phase to Connecting/Reconnecting, feeds connection outcomes as ConnEvents, and executes the returned Action. All policy constants and backoff arithmetic are removed from engine.rs. The _host lifetime invariant (must stay bound across the keep-alive select! to avoid the busy-loop) is preserved and documented inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Pre-existing formatting deviations normalized by `cargo fmt` so the `cargo fmt --check` gate stays green. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…nect errors The host engine forwarded every port as `http`, ignoring the configured protocol. A port created as `https`/`auto` was rejected by the service with `400 "the tunnel port protocol cannot be changed"`, and the keep-alive loop retried forever (re-minting tokens every cycle), never reaching `Hosting`. Only `http` ports could be hosted. - connect_once: register each port with its configured protocol (fallback `auto` when absent); collect_ports now carries `(port, protocol)`, threaded through spawn_group -> host_group -> connect_once. - Harden against non-recoverable failures: classify connect errors as Auth / Fatal / Transient (devtunnel::is_fatal_connect_error) in the pure keep-alive state machine (new ConnFailure enum + Action::Fail); a fatal error now surfaces HostState::Error and stops instead of an endless backoff loop. Completes the #35 keep-alive driver this builds on. Verified end-to-end against the live service: http/https/auto all reach Hosting (https needs a TLS backend to serve); no regression on the http happy path or resilience. cargo test (75, incl. new fatal-path test), fmt, and clippy (default + --features hosting) clean. Closes #36 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The tray GUI can't be scripted, but its hosting engine is the product's value. Add a headless entrypoint (DEVTUNNEL_HEADLESS_HOST=<id,...>) that drives the production path (host::spawn -> engine::host_group -> keep-alive state machine) and streams every HostEvent as JSON on stdout, returning before any UI is built. Real engine only under --features hosting. tests/e2e/ is a Python blackbox suite that uses the product as a user would: creates groups on a shared local port, hosts them through the headless engine, serves a real backend, and runs resilience scenarios while sampling the host process: - S2 multiple groups, same port - S3 sustained load + latency + idle/loaded host CPU & RSS (busy-loop watch) - S1 reconnect after drop (stop->rehost proxy; real relay drop when elevated) - S4 auto-resume after process kill Emits report.md/json (gitignored) with a thresholded findings section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…utor (#38) connect_once minted the two tokens sequentially with blocking subprocess calls on the group's current-thread runtime. That doubled the mint wait and, during a periodic re-mint, stalled the still-live relay + port-forward tasks sharing the executor -- widening the very outage the re-mint exists to avoid. Mint each token on its own spawn_blocking thread and overlap them with try_join!, so the round-trips run in parallel and the old connection keeps forwarding while new tokens mint. Cuts initial connect time and shrinks the re-mint blip without overlapping two live relay connections (which would need live validation of two-simultaneous-hosts behavior -- left as follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

collect_ports called fetch_rows, which enumerates the whole account: a `devtunnel list` plus a `devtunnel show` for *every* tunnel, then discards all but the one being hosted. Hosting one tunnel therefore cost 1 + N subprocess round-trips (N = total tunnels), run serially before the relay handshake -- and the live E2E showed this, not the handshake, dominated the ~14-18s connect/resume time. Replace it with a targeted `fetch_tunnel_ports`: one `devtunnel show <id> -j` for just the hosted tunnel, mapped to (port, protocol) by a pure, unit-tested helper (protocol preserved per #36). Account size no longer affects connect time. Measured on the blackbox E2E (live brs cluster): connect to Hosting ~14-18s -> ~2-5s, stop->rehost ~16.5s -> ~4.4s, cold recover ~16s -> ~1.4-4.9s; serving True, error rate 0, host CPU/RSS unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ss (#45) A connect spends most of its time in three phases -- minting tokens, the relay handshake, and forwarding ports -- but reported only one static "Connecting" label, making a multi-second wait indistinguishable from a hang. Add an additive HostEvent::Progress { phase } emitted by connect_once at each phase boundary. The coarse Connecting/Hosting state transitions are unchanged, so the headless JSON contract the E2E depends on is preserved (the new "progress" line is additive). The GUI maps each phase to a Fluent status-bar string (status-connect-*); the headless runner serializes it as an additive "progress" event. Verified live: the stream now interleaves Connecting -> progress(authorizing) -> progress(connecting_relay) -> progress(forwarding_ports) -> Hosting, which also shows token minting (~1.9s) is now the dominant connect cost after the #44 port-fetch fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…minting (#47) After #44, token minting (~1.9s of two `devtunnel token` subprocess round-trips) is the dominant connect cost -- and connect_once re-minted on every attempt, including relay-drop reconnects where the previous tokens are still valid (~24h lifetime; the engine already re-mints proactively at 20h). Cache the minted (host, manage) pair driver-side in host_group and reuse it: - relay-drop reconnect -> reuse cached tokens (skip the mint and the `Authorizing` phase); - RemintDue (~20h) -> clear the cache and mint fresh before expiry; - connect failure -> cache already taken and not restored, so the next attempt re-mints (no stale-token reuse loop). No expiry parsing needed: the 20h re-mint timer bounds reuse well inside the ~24h validity. mint_tokens is split out of connect_once, which now takes an Option<Tokens> and returns the tokens used so the caller can cache them. The _host busy-loop invariant is unchanged. Live: first connect still mints + serves (Connecting -> authorizing -> connecting_relay -> forwarding_ports -> Hosting). The in-session relay-drop reuse path needs an elevated firewall block to force (same S1b limitation the E2E documents); reviewed by inspection. Gates: cargo test (76), clippy default + --features hosting, fmt --check -- all green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…euse (#47) A genuine in-session relay drop (the path that exercises #47's token reuse) could only be forced with an elevated firewall block, which is slow and flaky: an outbound block does not sever the established relay socket until a long keepalive timeout, and a held block makes the reconnect attempts fail (which by design clears the cache and re-mints), so it never cleanly demonstrates reuse. Add a HostCommand::DropRelay that signals a per-group Notify raced in the keep-alive select!, producing a RelayDropped without tearing the group down. The headless runner exposes it as a `drop <id>` stdin command. This forces a deterministic reconnect with no network outage, firewall, or admin. Verified reuse with it (non-elevated): after `drop`, the reconnect goes straight to connecting_relay with NO `authorizing` phase and reaches Hosting in ~0.5s (vs ~2.4s on first connect) -- the relay accepts the reused token and the ~1.9s mint is skipped. Closes the open verification item on #47. Gates: cargo test (76), clippy default + --features hosting, fmt --check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…chine (#39) Bridge the public-URL health probe into the keep-alive policy so a zombie tunnel (relay session the SDK still believes live, but whose public URL is dead) forces a reconnect instead of hanging in `Hosting` forever. This commit lands the *pure, unit-tested* half of #39: the policy. The driver wiring (feeding probe ticks into the engine's keep-alive `select!`) stays out, gated on the #37 zombie-tunnel go-decision — `Action::Reconnect` is therefore never emitted by `engine.rs` yet, only handled. - `ProbeOutcome { Healthy, Down, ServiceDown }` and `ConnEvent::Probe(_)`: the streak is counted inside the state machine so the false-positive guard is pure and testable. Only a `Down` streak reaching `PROBE_DOWN_THRESHOLD` (3) on a live `Hosting` session yields `Action::Reconnect`; `ServiceDown` (relay alive, local upstream down — e.g. a server restart) never triggers, per the #39 acceptance criterion. - Probes before the first connect, or after a session-ending event, are absorbed as `Await` — the watchdog only arms between `Connected` and the next teardown. - `Reconnect` reconnects immediately with no extra backoff, funnelling into the existing `connect_once` path (no parallel reconnect logic). 8 new state-machine tests cover the streak threshold, the ServiceDown guard, streak resets, the not-connected windows, and re-arming after a reconnect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

) The probe could not see a zombie tunnel. `combine` deliberately reports a Public-URL network error as `Operational` while the local port is listening (a transient WAN hiccup is not a service outage), so the exact zombie state — local upstream fine, Public URL dead, the SDK's `RelayHandle` never resolving so the engine stays `Hosting` — was invisible to every layer. The probe's `Down` is only ever set by the engine's `RelayHandle`, which in a zombie never fires. The signal #39's watchdog needs did not exist yet. Surface it without changing the badge: when the slow HTTP fallback finds the Public URL unreachable while the local port is up, the probe emits a new `ProbeEvent::PublicUnreachable`. The wiring layer logs it at WARN only when the engine still believes that group is `Hosting` (the full zombie signature), and at DEBUG otherwise (an ordinary drop the engine is already reconnecting). This is the lightweight instrumentation of #37: pure observability, no behaviour change. The recorded occurrences over real-use hosting feed the #37 go/no-go decision and, once that gate opens, the #39 reconnect bridge (whose pure policy already landed in keepalive.rs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Move the view reconciliation logic out of `rebuild_rows` into a new pure `src/view.rs` with one entry point, `fold(&FoldInput) -> FoldOutput`. The four sources of truth (CLI rows, probe results, host state, optimistic delete/placeholder sets) are now merged in a module free of any Slint, channel, or `Rc<RefCell>` dependency, returning plain `GroupViewData` / `PortViewData`. `rebuild_rows` becomes a thin adapter: feed inputs, map the plain result onto Slint structs, rebuild the tray menu, set props. `derive_status`, `derive_host_state`, the `Placeholder` struct, and `PROVISIONING_STATUS` move into the module. Adds 14 table-driven tests covering badge mapping for the 3 probe states, optimistic-delete hiding (single port / whole group / last-port-portless), placeholder folding, the hosting pill, and detail-panel reconciliation. Zero behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

paulocorcino and others added 14 commits June 18, 2026 04:28

style: apply rustfmt to devtunnel.rs

7bd333f

Pre-existing formatting deviations normalized by `cargo fmt` so the `cargo fmt --check` gate stays green. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

paulocorcino merged commit 9a447bb into main Jun 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Host/probe reliability fixes + pure view-fold refactor#48

Host/probe reliability fixes + pure view-fold refactor#48
paulocorcino merged 14 commits into
mainfrom
feat/fixes-202606

paulocorcino commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paulocorcino commented Jun 19, 2026

Highlights

Pure view-fold module (#42)

Host / probe reliability (#35, #37, #38, #39, #44, #45, #47)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant