feat(checkout): harden slow-retry tier 2 against hangs and false positives#539
feat(checkout): harden slow-retry tier 2 against hangs and false positives#539skipi wants to merge 1 commit into
Conversation
…tives Follow-up to #538. Three fixes to the opt-in SEMAPHORE_GIT_CLONE_SLOW_RETRY path: - DoH alt-IP resolution now passes curl --connect-timeout/--max-time (5s/10s, overridable). Without them, tier 2 hangs indefinitely when dns.google is blocked or unreachable — common on locked-down self-hosted runners. - Speed check gains a startup grace window (SEMAPHORE_GIT_CLONE_SLOW_GRACE, default 30s) before slow-detection arms. Server-side object counting/compression and connection setup can run for seconds with little on-disk growth; penalizing it killed healthy clones of large repos. - SSH alt-IP path checks for nc up front and fails cleanly with a clear message instead of producing a cryptic git/ssh error when nc is absent. Adds bats coverage for the grace window, curl timeout flags, and the missing-nc path. Full suite (22 tests) verified on a Semaphore ubuntu2204 f1-standard-2 testbox. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| elapsed=$((elapsed + check_interval)) | ||
|
|
||
| local cur_size | ||
| cur_size=$(du -sk "${target_dir}" 2>/dev/null | awk '{print $1}') |
There was a problem hiding this comment.
@skipi Disk-size delta is not a reliable network-throughput signal. Healthy clone phases can have flat/negative growth and be killed as "slow". Grace only delays arming; it does not fix the signal. Please replace this with transport-level checks (e.g. git/curl low-speed controls for HTTPS) plus an absolute timeout.
| continue | ||
| fi | ||
|
|
||
| if [ "$cur_size" -gt 0 ]; then |
There was a problem hiding this comment.
@skipi Slow accounting still requires cur_size > 0, so pre-write stalls (DNS/TCP/TLS/auth) are never detected by this watchdog. Please add a no-progress timeout and a hard wall-clock timeout independent of directory size.
| slow_seconds=$((slow_seconds + check_interval)) | ||
| if [ "$slow_seconds" -ge "$timeout" ]; then | ||
| echo "[checkout] Slow clone detected (${speed} bytes/sec for ${slow_seconds}s)" | ||
| kill "$pid" 2>/dev/null |
There was a problem hiding this comment.
@skipi kill "$pid" only targets the parent. git children can survive and race with rm -rf on retry (lines 311/366). Please run clone in its own process group and terminate the group before cleanup.
| } | ||
|
|
||
| function checkout::resilient_clone() { | ||
| local retry_count="${SEMAPHORE_GIT_CLONE_RETRY_COUNT:-2}" |
There was a problem hiding this comment.
@skipi Default retry count is 2 here; legacy retry path effectively attempted 3.
| fi | ||
| echo "Performing shallow clone with depth: $SEMAPHORE_GIT_DEPTH" | ||
|
|
||
| if ! checkout::git_clone --depth "${SEMAPHORE_GIT_DEPTH}" -b "${SEMAPHORE_GIT_BRANCH}" "${SEMAPHORE_GIT_URL}" "${SEMAPHORE_GIT_DIR}"; then |
There was a problem hiding this comment.
@skipi This fallback treats any clone failure as "branch not found" and escalates to full clone. With resilient mode, slow/network failures hit this too. Please return distinct failure classes from checkout::git_clone and only full-clone on genuine ref/branch errors.
| fi | ||
|
|
||
| # GeoDNS-based alt-IP fallback is GitHub-specific; skip for other providers | ||
| if [ "$git_host" != "github.com" ]; then |
There was a problem hiding this comment.
@skipi Tier-2 host gate only allows github.com. This excludes valid GitHub SSH-over-443 endpoint ssh.github.com. Please allowlist both github.com and ssh.github.com after normalizing host (userinfo/port stripped).
…feature flag (#1046) ## What Adds a per-organization feature flag, `git_clone_slow_retry`, that — when enabled — injects `SEMAPHORE_GIT_CLONE_SLOW_RETRY=true` into the job environment built by `JobRequestFactory`. This is the **producer** side of the toolbox checkout-resiliency work (semaphoreci/toolbox#538 + follow-up semaphoreci/toolbox#539). The toolbox `checkout` reads `SEMAPHORE_GIT_CLONE_SLOW_RETRY` to opt into slow-clone detection + resilient retry (speed monitoring, retries, alternative-endpoint fallback), and is a no-op when the var is absent. So the rollout is: flip the flag for an org → zebra injects the var → toolbox enables the behavior. Default off = unchanged behavior. ## How Mirrors the existing `TestResults` / `:test_results_no_trim` pattern exactly: - New `Zebra.Workers.JobRequestFactory.GitCheckout` module with `env_vars/1`, gated on `FeatureProvider.feature_enabled?(:git_clone_slow_retry, param: org_id)`. - Appended to the env-var list in `JobRequestFactory` alongside `TestResults.env_vars(org_id)`. - Registered in the test `StubbedProvider` (hidden by default, enabled for a dedicated test org id). Deliberately *not* threading `org_id` through `Repository.*` — that would churn the exact-match repository tests for no benefit. Only the on/off switch is injected; the toolbox owns the tuning defaults (threshold / timeout / grace / retries). ## Tests `zebra/test/zebra/workers/job_request_factory/git_checkout_test.exs` — feature disabled → `[]`; enabled → `SEMAPHORE_GIT_CLONE_SLOW_RETRY=true`. ## Follow-up (not in this PR) The `git_clone_slow_retry` feature must be registered in the feature management backend before it can be toggled per-org in production. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Follow-up to #538, which added the opt-in
SEMAPHORE_GIT_CLONE_SLOW_RETRYcheckout resilience. Three issues raised in that review were still live onmaster; this addresses them.Changes
1. DoH alt-IP resolution can hang (tier 2)
checkout::resolve_alt_ipscalledcurlagainstdns.googlewith no timeout. On locked-down self-hosted runners that block outbound DoH, the tier-2 fallback would hang indefinitely (once per region). Now passes--connect-timeout/--max-time(5s / 10s, overridable viaSEMAPHORE_GIT_CLONE_DOH_CONNECT_TIMEOUT/SEMAPHORE_GIT_CLONE_DOH_MAX_TIME).2. Speed check false-positives on large repos
checkout::clone_with_speed_checkarmed slow-detection immediately. Server-side object counting/compression and connection setup can run for several seconds with little on-disk growth before the receive phase ramps up, so a healthy clone of a large repo could be killed as "slow". Added a startup grace window (SEMAPHORE_GIT_CLONE_SLOW_GRACE, default 30s) before detection arms.3. SSH alt-IP path errors cryptically when
ncis absentThe SSH tier-2 fallback routes through
ProxyCommand='nc ...'but didn't checkncexists. Now fails fast with a clear message and lets the caller move on, instead of surfacing an opaque git/ssh error.All three are scoped to the opt-in path; default behavior (flag off) is unchanged.
Tests
Added bats coverage for the grace window, the curl timeout flags, and the missing-
ncpath. Full suite (22 tests) run on a Semaphoreubuntu2204f1-standard-2testbox — all green, including the three real-clone full-checkout flows (push / PR / tag):🤖 Generated with Claude Code