Host/probe reliability fixes + pure view-fold refactor#48
Merged
Conversation
Extract the host engine's keep-alive policy (reconnect backoff, token re-mint timing, auth-error relogin path) into a pure, dependency-free state machine in src/host/keepalive.rs. Declared unconditionally so its tests run without the vendored-OpenSSL toolchain. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cover backoff progression (2,4,8,16,32,60,60) and reset-on-success, re-mint scheduling, auth-error to relogin, and reconnect phase change. Verified RED against a todo!() next() then GREEN once implemented. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrite host_group as a thin driver around KeepAliveState: it maps the Phase to Connecting/Reconnecting, feeds connection outcomes as ConnEvents, and executes the returned Action. All policy constants and backoff arithmetic are removed from engine.rs. The _host lifetime invariant (must stay bound across the keep-alive select! to avoid the busy-loop) is preserved and documented inline. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-existing formatting deviations normalized by `cargo fmt` so the `cargo fmt --check` gate stays green. No behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nect errors The host engine forwarded every port as `http`, ignoring the configured protocol. A port created as `https`/`auto` was rejected by the service with `400 "the tunnel port protocol cannot be changed"`, and the keep-alive loop retried forever (re-minting tokens every cycle), never reaching `Hosting`. Only `http` ports could be hosted. - connect_once: register each port with its configured protocol (fallback `auto` when absent); collect_ports now carries `(port, protocol)`, threaded through spawn_group -> host_group -> connect_once. - Harden against non-recoverable failures: classify connect errors as Auth / Fatal / Transient (devtunnel::is_fatal_connect_error) in the pure keep-alive state machine (new ConnFailure enum + Action::Fail); a fatal error now surfaces HostState::Error and stops instead of an endless backoff loop. Completes the #35 keep-alive driver this builds on. Verified end-to-end against the live service: http/https/auto all reach Hosting (https needs a TLS backend to serve); no regression on the http happy path or resilience. cargo test (75, incl. new fatal-path test), fmt, and clippy (default + --features hosting) clean. Closes #36 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The tray GUI can't be scripted, but its hosting engine is the product's value. Add a headless entrypoint (DEVTUNNEL_HEADLESS_HOST=<id,...>) that drives the production path (host::spawn -> engine::host_group -> keep-alive state machine) and streams every HostEvent as JSON on stdout, returning before any UI is built. Real engine only under --features hosting. tests/e2e/ is a Python blackbox suite that uses the product as a user would: creates groups on a shared local port, hosts them through the headless engine, serves a real backend, and runs resilience scenarios while sampling the host process: - S2 multiple groups, same port - S3 sustained load + latency + idle/loaded host CPU & RSS (busy-loop watch) - S1 reconnect after drop (stop->rehost proxy; real relay drop when elevated) - S4 auto-resume after process kill Emits report.md/json (gitignored) with a thresholded findings section. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…utor (#38) connect_once minted the two tokens sequentially with blocking subprocess calls on the group's current-thread runtime. That doubled the mint wait and, during a periodic re-mint, stalled the still-live relay + port-forward tasks sharing the executor -- widening the very outage the re-mint exists to avoid. Mint each token on its own spawn_blocking thread and overlap them with try_join!, so the round-trips run in parallel and the old connection keeps forwarding while new tokens mint. Cuts initial connect time and shrinks the re-mint blip without overlapping two live relay connections (which would need live validation of two-simultaneous-hosts behavior -- left as follow-up). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
collect_ports called fetch_rows, which enumerates the whole account: a `devtunnel list` plus a `devtunnel show` for *every* tunnel, then discards all but the one being hosted. Hosting one tunnel therefore cost 1 + N subprocess round-trips (N = total tunnels), run serially before the relay handshake -- and the live E2E showed this, not the handshake, dominated the ~14-18s connect/resume time. Replace it with a targeted `fetch_tunnel_ports`: one `devtunnel show <id> -j` for just the hosted tunnel, mapped to (port, protocol) by a pure, unit-tested helper (protocol preserved per #36). Account size no longer affects connect time. Measured on the blackbox E2E (live brs cluster): connect to Hosting ~14-18s -> ~2-5s, stop->rehost ~16.5s -> ~4.4s, cold recover ~16s -> ~1.4-4.9s; serving True, error rate 0, host CPU/RSS unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ss (#45) A connect spends most of its time in three phases -- minting tokens, the relay handshake, and forwarding ports -- but reported only one static "Connecting" label, making a multi-second wait indistinguishable from a hang. Add an additive HostEvent::Progress { phase } emitted by connect_once at each phase boundary. The coarse Connecting/Hosting state transitions are unchanged, so the headless JSON contract the E2E depends on is preserved (the new "progress" line is additive). The GUI maps each phase to a Fluent status-bar string (status-connect-*); the headless runner serializes it as an additive "progress" event. Verified live: the stream now interleaves Connecting -> progress(authorizing) -> progress(connecting_relay) -> progress(forwarding_ports) -> Hosting, which also shows token minting (~1.9s) is now the dominant connect cost after the #44 port-fetch fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…minting (#47) After #44, token minting (~1.9s of two `devtunnel token` subprocess round-trips) is the dominant connect cost -- and connect_once re-minted on every attempt, including relay-drop reconnects where the previous tokens are still valid (~24h lifetime; the engine already re-mints proactively at 20h). Cache the minted (host, manage) pair driver-side in host_group and reuse it: - relay-drop reconnect -> reuse cached tokens (skip the mint and the `Authorizing` phase); - RemintDue (~20h) -> clear the cache and mint fresh before expiry; - connect failure -> cache already taken and not restored, so the next attempt re-mints (no stale-token reuse loop). No expiry parsing needed: the 20h re-mint timer bounds reuse well inside the ~24h validity. mint_tokens is split out of connect_once, which now takes an Option<Tokens> and returns the tokens used so the caller can cache them. The _host busy-loop invariant is unchanged. Live: first connect still mints + serves (Connecting -> authorizing -> connecting_relay -> forwarding_ports -> Hosting). The in-session relay-drop reuse path needs an elevated firewall block to force (same S1b limitation the E2E documents); reviewed by inspection. Gates: cargo test (76), clippy default + --features hosting, fmt --check -- all green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…euse (#47) A genuine in-session relay drop (the path that exercises #47's token reuse) could only be forced with an elevated firewall block, which is slow and flaky: an outbound block does not sever the established relay socket until a long keepalive timeout, and a held block makes the reconnect attempts fail (which by design clears the cache and re-mints), so it never cleanly demonstrates reuse. Add a HostCommand::DropRelay that signals a per-group Notify raced in the keep-alive select!, producing a RelayDropped without tearing the group down. The headless runner exposes it as a `drop <id>` stdin command. This forces a deterministic reconnect with no network outage, firewall, or admin. Verified reuse with it (non-elevated): after `drop`, the reconnect goes straight to connecting_relay with NO `authorizing` phase and reaches Hosting in ~0.5s (vs ~2.4s on first connect) -- the relay accepts the reused token and the ~1.9s mint is skipped. Closes the open verification item on #47. Gates: cargo test (76), clippy default + --features hosting, fmt --check. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…chine (#39) Bridge the public-URL health probe into the keep-alive policy so a zombie tunnel (relay session the SDK still believes live, but whose public URL is dead) forces a reconnect instead of hanging in `Hosting` forever. This commit lands the *pure, unit-tested* half of #39: the policy. The driver wiring (feeding probe ticks into the engine's keep-alive `select!`) stays out, gated on the #37 zombie-tunnel go-decision — `Action::Reconnect` is therefore never emitted by `engine.rs` yet, only handled. - `ProbeOutcome { Healthy, Down, ServiceDown }` and `ConnEvent::Probe(_)`: the streak is counted inside the state machine so the false-positive guard is pure and testable. Only a `Down` streak reaching `PROBE_DOWN_THRESHOLD` (3) on a live `Hosting` session yields `Action::Reconnect`; `ServiceDown` (relay alive, local upstream down — e.g. a server restart) never triggers, per the #39 acceptance criterion. - Probes before the first connect, or after a session-ending event, are absorbed as `Await` — the watchdog only arms between `Connected` and the next teardown. - `Reconnect` reconnects immediately with no extra backoff, funnelling into the existing `connect_once` path (no parallel reconnect logic). 8 new state-machine tests cover the streak threshold, the ServiceDown guard, streak resets, the not-connected windows, and re-arming after a reconnect. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
) The probe could not see a zombie tunnel. `combine` deliberately reports a Public-URL network error as `Operational` while the local port is listening (a transient WAN hiccup is not a service outage), so the exact zombie state — local upstream fine, Public URL dead, the SDK's `RelayHandle` never resolving so the engine stays `Hosting` — was invisible to every layer. The probe's `Down` is only ever set by the engine's `RelayHandle`, which in a zombie never fires. The signal #39's watchdog needs did not exist yet. Surface it without changing the badge: when the slow HTTP fallback finds the Public URL unreachable while the local port is up, the probe emits a new `ProbeEvent::PublicUnreachable`. The wiring layer logs it at WARN only when the engine still believes that group is `Hosting` (the full zombie signature), and at DEBUG otherwise (an ordinary drop the engine is already reconnecting). This is the lightweight instrumentation of #37: pure observability, no behaviour change. The recorded occurrences over real-use hosting feed the #37 go/no-go decision and, once that gate opens, the #39 reconnect bridge (whose pure policy already landed in keepalive.rs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the view reconciliation logic out of `rebuild_rows` into a new pure `src/view.rs` with one entry point, `fold(&FoldInput) -> FoldOutput`. The four sources of truth (CLI rows, probe results, host state, optimistic delete/placeholder sets) are now merged in a module free of any Slint, channel, or `Rc<RefCell>` dependency, returning plain `GroupViewData` / `PortViewData`. `rebuild_rows` becomes a thin adapter: feed inputs, map the plain result onto Slint structs, rebuild the tray menu, set props. `derive_status`, `derive_host_state`, the `Placeholder` struct, and `PROVISIONING_STATUS` move into the module. Adds 14 table-driven tests covering badge mapping for the 3 probe states, optimistic-delete hiding (single port / whole group / last-port-portless), placeholder folding, the hosting pill, and detail-panel reconciliation. Zero behavior change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bundles the host/probe reliability work on this branch with the view-layer refactor.
Highlights
Pure view-fold module (#42)
Extracts the view reconciliation out of
rebuild_rowsinto a puresrc/view.rswith one entry point,fold(&FoldInput) -> FoldOutput, free of any Slint / channel /Rc<RefCell>dependency.rebuild_rowsbecomes a thin adapter. 14 new table-driven tests cover probe badge mapping, optimistic-delete hiding, placeholder folding, the hosting pill, and detail-panel reconciliation. Zero behavior change.Host / probe reliability (#35, #37, #38, #39, #44, #45, #47)
Verification
cargo buildandcargo build --features hosting: both pass.cargo test: 93 (default) / 102 (hosting) pass.cargo clippy+cargo fmt --check: clean on both feature sets.🤖 Generated with Claude Code