Skip to content

Host/probe reliability fixes + pure view-fold refactor#48

Merged
paulocorcino merged 14 commits into
mainfrom
feat/fixes-202606
Jun 19, 2026
Merged

Host/probe reliability fixes + pure view-fold refactor#48
paulocorcino merged 14 commits into
mainfrom
feat/fixes-202606

Conversation

@paulocorcino

Copy link
Copy Markdown
Owner

Bundles the host/probe reliability work on this branch with the view-layer refactor.

Highlights

Pure view-fold module (#42)

Extracts the view reconciliation out of rebuild_rows into a pure src/view.rs with one entry point, fold(&FoldInput) -> FoldOutput, free of any Slint / channel / Rc<RefCell> dependency. rebuild_rows becomes a thin adapter. 14 new table-driven tests cover probe badge mapping, optimistic-delete hiding, placeholder folding, the hosting pill, and detail-panel reconciliation. Zero behavior change.

Host / probe reliability (#35, #37, #38, #39, #44, #45, #47)

Verification

  • cargo build and cargo build --features hosting: both pass.
  • cargo test: 93 (default) / 102 (hosting) pass.
  • cargo clippy + cargo fmt --check: clean on both feature sets.
  • e2e: GUI loaded a live tunnel through the new fold path and rendered without panic.

🤖 Generated with Claude Code

paulocorcino and others added 14 commits June 18, 2026 04:28
Extract the host engine's keep-alive policy (reconnect backoff, token
re-mint timing, auth-error relogin path) into a pure, dependency-free
state machine in src/host/keepalive.rs. Declared unconditionally so its
tests run without the vendored-OpenSSL toolchain.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Cover backoff progression (2,4,8,16,32,60,60) and reset-on-success,
re-mint scheduling, auth-error to relogin, and reconnect phase change.
Verified RED against a todo!() next() then GREEN once implemented.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Rewrite host_group as a thin driver around KeepAliveState: it maps the
Phase to Connecting/Reconnecting, feeds connection outcomes as ConnEvents,
and executes the returned Action. All policy constants and backoff
arithmetic are removed from engine.rs. The _host lifetime invariant
(must stay bound across the keep-alive select! to avoid the busy-loop)
is preserved and documented inline.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-existing formatting deviations normalized by `cargo fmt` so the
`cargo fmt --check` gate stays green. No behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…nect errors

The host engine forwarded every port as `http`, ignoring the configured
protocol. A port created as `https`/`auto` was rejected by the service with
`400 "the tunnel port protocol cannot be changed"`, and the keep-alive loop
retried forever (re-minting tokens every cycle), never reaching `Hosting`.
Only `http` ports could be hosted.

- connect_once: register each port with its configured protocol (fallback
  `auto` when absent); collect_ports now carries `(port, protocol)`, threaded
  through spawn_group -> host_group -> connect_once.
- Harden against non-recoverable failures: classify connect errors as
  Auth / Fatal / Transient (devtunnel::is_fatal_connect_error) in the pure
  keep-alive state machine (new ConnFailure enum + Action::Fail); a fatal
  error now surfaces HostState::Error and stops instead of an endless
  backoff loop. Completes the #35 keep-alive driver this builds on.

Verified end-to-end against the live service: http/https/auto all reach
Hosting (https needs a TLS backend to serve); no regression on the http
happy path or resilience. cargo test (75, incl. new fatal-path test), fmt,
and clippy (default + --features hosting) clean.

Closes #36

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The tray GUI can't be scripted, but its hosting engine is the product's
value. Add a headless entrypoint (DEVTUNNEL_HEADLESS_HOST=<id,...>) that
drives the production path (host::spawn -> engine::host_group -> keep-alive
state machine) and streams every HostEvent as JSON on stdout, returning
before any UI is built. Real engine only under --features hosting.

tests/e2e/ is a Python blackbox suite that uses the product as a user would:
creates groups on a shared local port, hosts them through the headless
engine, serves a real backend, and runs resilience scenarios while sampling
the host process:
  - S2 multiple groups, same port
  - S3 sustained load + latency + idle/loaded host CPU & RSS (busy-loop watch)
  - S1 reconnect after drop (stop->rehost proxy; real relay drop when elevated)
  - S4 auto-resume after process kill
Emits report.md/json (gitignored) with a thresholded findings section.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…utor (#38)

connect_once minted the two tokens sequentially with blocking subprocess
calls on the group's current-thread runtime. That doubled the mint wait and,
during a periodic re-mint, stalled the still-live relay + port-forward tasks
sharing the executor -- widening the very outage the re-mint exists to avoid.

Mint each token on its own spawn_blocking thread and overlap them with
try_join!, so the round-trips run in parallel and the old connection keeps
forwarding while new tokens mint. Cuts initial connect time and shrinks the
re-mint blip without overlapping two live relay connections (which would need
live validation of two-simultaneous-hosts behavior -- left as follow-up).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
collect_ports called fetch_rows, which enumerates the whole account: a
`devtunnel list` plus a `devtunnel show` for *every* tunnel, then discards all
but the one being hosted. Hosting one tunnel therefore cost 1 + N subprocess
round-trips (N = total tunnels), run serially before the relay handshake --
and the live E2E showed this, not the handshake, dominated the ~14-18s
connect/resume time.

Replace it with a targeted `fetch_tunnel_ports`: one `devtunnel show <id> -j`
for just the hosted tunnel, mapped to (port, protocol) by a pure, unit-tested
helper (protocol preserved per #36). Account size no longer affects connect
time.

Measured on the blackbox E2E (live brs cluster): connect to Hosting ~14-18s ->
~2-5s, stop->rehost ~16.5s -> ~4.4s, cold recover ~16s -> ~1.4-4.9s; serving
True, error rate 0, host CPU/RSS unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ss (#45)

A connect spends most of its time in three phases -- minting tokens,
the relay handshake, and forwarding ports -- but reported only one static
"Connecting" label, making a multi-second wait indistinguishable from a hang.

Add an additive HostEvent::Progress { phase } emitted by connect_once at each
phase boundary. The coarse Connecting/Hosting state transitions are unchanged,
so the headless JSON contract the E2E depends on is preserved (the new
"progress" line is additive). The GUI maps each phase to a Fluent status-bar
string (status-connect-*); the headless runner serializes it as an additive
"progress" event.

Verified live: the stream now interleaves
Connecting -> progress(authorizing) -> progress(connecting_relay) ->
progress(forwarding_ports) -> Hosting, which also shows token minting (~1.9s)
is now the dominant connect cost after the #44 port-fetch fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…minting (#47)

After #44, token minting (~1.9s of two `devtunnel token` subprocess round-trips)
is the dominant connect cost -- and connect_once re-minted on every attempt,
including relay-drop reconnects where the previous tokens are still valid
(~24h lifetime; the engine already re-mints proactively at 20h).

Cache the minted (host, manage) pair driver-side in host_group and reuse it:
- relay-drop reconnect -> reuse cached tokens (skip the mint and the
  `Authorizing` phase);
- RemintDue (~20h) -> clear the cache and mint fresh before expiry;
- connect failure -> cache already taken and not restored, so the next attempt
  re-mints (no stale-token reuse loop).

No expiry parsing needed: the 20h re-mint timer bounds reuse well inside the
~24h validity. mint_tokens is split out of connect_once, which now takes an
Option<Tokens> and returns the tokens used so the caller can cache them. The
_host busy-loop invariant is unchanged.

Live: first connect still mints + serves (Connecting -> authorizing ->
connecting_relay -> forwarding_ports -> Hosting). The in-session relay-drop
reuse path needs an elevated firewall block to force (same S1b limitation the
E2E documents); reviewed by inspection. Gates: cargo test (76), clippy default
+ --features hosting, fmt --check -- all green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…euse (#47)

A genuine in-session relay drop (the path that exercises #47's token reuse)
could only be forced with an elevated firewall block, which is slow and flaky:
an outbound block does not sever the established relay socket until a long
keepalive timeout, and a held block makes the reconnect attempts fail (which by
design clears the cache and re-mints), so it never cleanly demonstrates reuse.

Add a HostCommand::DropRelay that signals a per-group Notify raced in the
keep-alive select!, producing a RelayDropped without tearing the group down.
The headless runner exposes it as a `drop <id>` stdin command. This forces a
deterministic reconnect with no network outage, firewall, or admin.

Verified reuse with it (non-elevated): after `drop`, the reconnect goes
straight to connecting_relay with NO `authorizing` phase and reaches Hosting in
~0.5s (vs ~2.4s on first connect) -- the relay accepts the reused token and the
~1.9s mint is skipped. Closes the open verification item on #47.

Gates: cargo test (76), clippy default + --features hosting, fmt --check.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…chine (#39)

Bridge the public-URL health probe into the keep-alive policy so a zombie
tunnel (relay session the SDK still believes live, but whose public URL is
dead) forces a reconnect instead of hanging in `Hosting` forever.

This commit lands the *pure, unit-tested* half of #39: the policy. The driver
wiring (feeding probe ticks into the engine's keep-alive `select!`) stays out,
gated on the #37 zombie-tunnel go-decision — `Action::Reconnect` is therefore
never emitted by `engine.rs` yet, only handled.

- `ProbeOutcome { Healthy, Down, ServiceDown }` and `ConnEvent::Probe(_)`: the
  streak is counted inside the state machine so the false-positive guard is pure
  and testable. Only a `Down` streak reaching `PROBE_DOWN_THRESHOLD` (3) on a
  live `Hosting` session yields `Action::Reconnect`; `ServiceDown` (relay alive,
  local upstream down — e.g. a server restart) never triggers, per the #39
  acceptance criterion.
- Probes before the first connect, or after a session-ending event, are absorbed
  as `Await` — the watchdog only arms between `Connected` and the next teardown.
- `Reconnect` reconnects immediately with no extra backoff, funnelling into the
  existing `connect_once` path (no parallel reconnect logic).

8 new state-machine tests cover the streak threshold, the ServiceDown guard,
streak resets, the not-connected windows, and re-arming after a reconnect.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
)

The probe could not see a zombie tunnel. `combine` deliberately reports a
Public-URL network error as `Operational` while the local port is listening
(a transient WAN hiccup is not a service outage), so the exact zombie state —
local upstream fine, Public URL dead, the SDK's `RelayHandle` never resolving
so the engine stays `Hosting` — was invisible to every layer. The probe's
`Down` is only ever set by the engine's `RelayHandle`, which in a zombie never
fires. The signal #39's watchdog needs did not exist yet.

Surface it without changing the badge: when the slow HTTP fallback finds the
Public URL unreachable while the local port is up, the probe emits a new
`ProbeEvent::PublicUnreachable`. The wiring layer logs it at WARN only when the
engine still believes that group is `Hosting` (the full zombie signature),
and at DEBUG otherwise (an ordinary drop the engine is already reconnecting).

This is the lightweight instrumentation of #37: pure observability, no
behaviour change. The recorded occurrences over real-use hosting feed the #37
go/no-go decision and, once that gate opens, the #39 reconnect bridge (whose
pure policy already landed in keepalive.rs).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Move the view reconciliation logic out of `rebuild_rows` into a new pure
`src/view.rs` with one entry point, `fold(&FoldInput) -> FoldOutput`. The
four sources of truth (CLI rows, probe results, host state, optimistic
delete/placeholder sets) are now merged in a module free of any Slint,
channel, or `Rc<RefCell>` dependency, returning plain `GroupViewData` /
`PortViewData`. `rebuild_rows` becomes a thin adapter: feed inputs, map the
plain result onto Slint structs, rebuild the tray menu, set props.

`derive_status`, `derive_host_state`, the `Placeholder` struct, and
`PROVISIONING_STATUS` move into the module. Adds 14 table-driven tests
covering badge mapping for the 3 probe states, optimistic-delete hiding
(single port / whole group / last-port-portless), placeholder folding, the
hosting pill, and detail-panel reconciliation. Zero behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@paulocorcino paulocorcino merged commit 9a447bb into main Jun 19, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant