Skip to content

feat(checkout): harden slow-retry tier 2 against hangs and false positives#539

Open
skipi wants to merge 1 commit into
masterfrom
skipi/github-resiliency-followups
Open

feat(checkout): harden slow-retry tier 2 against hangs and false positives#539
skipi wants to merge 1 commit into
masterfrom
skipi/github-resiliency-followups

Conversation

@skipi
Copy link
Copy Markdown
Member

@skipi skipi commented May 29, 2026

Follow-up to #538, which added the opt-in SEMAPHORE_GIT_CLONE_SLOW_RETRY checkout resilience. Three issues raised in that review were still live on master; this addresses them.

Changes

1. DoH alt-IP resolution can hang (tier 2)
checkout::resolve_alt_ips called curl against dns.google with no timeout. On locked-down self-hosted runners that block outbound DoH, the tier-2 fallback would hang indefinitely (once per region). Now passes --connect-timeout / --max-time (5s / 10s, overridable via SEMAPHORE_GIT_CLONE_DOH_CONNECT_TIMEOUT / SEMAPHORE_GIT_CLONE_DOH_MAX_TIME).

2. Speed check false-positives on large repos
checkout::clone_with_speed_check armed slow-detection immediately. Server-side object counting/compression and connection setup can run for several seconds with little on-disk growth before the receive phase ramps up, so a healthy clone of a large repo could be killed as "slow". Added a startup grace window (SEMAPHORE_GIT_CLONE_SLOW_GRACE, default 30s) before detection arms.

3. SSH alt-IP path errors cryptically when nc is absent
The SSH tier-2 fallback routes through ProxyCommand='nc ...' but didn't check nc exists. Now fails fast with a clear message and lets the caller move on, instead of surfacing an opaque git/ssh error.

All three are scoped to the opt-in path; default behavior (flag off) is unchanged.

Tests

Added bats coverage for the grace window, the curl timeout flags, and the missing-nc path. Full suite (22 tests) run on a Semaphore ubuntu2204 f1-standard-2 testbox — all green, including the three real-clone full-checkout flows (push / PR / tag):

1..22
ok 1 slow retry - disabled by default, normal clone works
...
ok 6 slow retry - speed check grace window protects slow start
ok 7 slow retry - resolve_alt_ips passes curl connect/max timeouts
...
ok 19 slow retry - clone_with_alt_ip fails cleanly for SSH when nc missing
ok 20 slow retry - full checkout with flag on succeeds for push
ok 21 slow retry - full checkout with flag on succeeds for PR
ok 22 slow retry - full checkout with flag on succeeds for tag

🤖 Generated with Claude Code

…tives

Follow-up to #538. Three fixes to the opt-in
SEMAPHORE_GIT_CLONE_SLOW_RETRY path:

- DoH alt-IP resolution now passes curl --connect-timeout/--max-time
  (5s/10s, overridable). Without them, tier 2 hangs indefinitely when
  dns.google is blocked or unreachable — common on locked-down
  self-hosted runners.
- Speed check gains a startup grace window (SEMAPHORE_GIT_CLONE_SLOW_GRACE,
  default 30s) before slow-detection arms. Server-side object
  counting/compression and connection setup can run for seconds with
  little on-disk growth; penalizing it killed healthy clones of large
  repos.
- SSH alt-IP path checks for nc up front and fails cleanly with a clear
  message instead of producing a cryptic git/ssh error when nc is absent.

Adds bats coverage for the grace window, curl timeout flags, and the
missing-nc path. Full suite (22 tests) verified on a Semaphore
ubuntu2204 f1-standard-2 testbox.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread libcheckout
elapsed=$((elapsed + check_interval))

local cur_size
cur_size=$(du -sk "${target_dir}" 2>/dev/null | awk '{print $1}')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipi Disk-size delta is not a reliable network-throughput signal. Healthy clone phases can have flat/negative growth and be killed as "slow". Grace only delays arming; it does not fix the signal. Please replace this with transport-level checks (e.g. git/curl low-speed controls for HTTPS) plus an absolute timeout.

Comment thread libcheckout
continue
fi

if [ "$cur_size" -gt 0 ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipi Slow accounting still requires cur_size > 0, so pre-write stalls (DNS/TCP/TLS/auth) are never detected by this watchdog. Please add a no-progress timeout and a hard wall-clock timeout independent of directory size.

Comment thread libcheckout
slow_seconds=$((slow_seconds + check_interval))
if [ "$slow_seconds" -ge "$timeout" ]; then
echo "[checkout] Slow clone detected (${speed} bytes/sec for ${slow_seconds}s)"
kill "$pid" 2>/dev/null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipi kill "$pid" only targets the parent. git children can survive and race with rm -rf on retry (lines 311/366). Please run clone in its own process group and terminate the group before cleanup.

Comment thread libcheckout
}

function checkout::resilient_clone() {
local retry_count="${SEMAPHORE_GIT_CLONE_RETRY_COUNT:-2}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipi Default retry count is 2 here; legacy retry path effectively attempted 3.

Comment thread libcheckout
fi
echo "Performing shallow clone with depth: $SEMAPHORE_GIT_DEPTH"

if ! checkout::git_clone --depth "${SEMAPHORE_GIT_DEPTH}" -b "${SEMAPHORE_GIT_BRANCH}" "${SEMAPHORE_GIT_URL}" "${SEMAPHORE_GIT_DIR}"; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipi This fallback treats any clone failure as "branch not found" and escalates to full clone. With resilient mode, slow/network failures hit this too. Please return distinct failure classes from checkout::git_clone and only full-clone on genuine ref/branch errors.

Comment thread libcheckout
fi

# GeoDNS-based alt-IP fallback is GitHub-specific; skip for other providers
if [ "$git_host" != "github.com" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@skipi Tier-2 host gate only allows github.com. This excludes valid GitHub SSH-over-443 endpoint ssh.github.com. Please allowlist both github.com and ssh.github.com after normalizing host (userinfo/port stripped).

skipi added a commit to semaphoreio/semaphore that referenced this pull request May 29, 2026
…feature flag (#1046)

## What

Adds a per-organization feature flag, `git_clone_slow_retry`, that —
when enabled — injects `SEMAPHORE_GIT_CLONE_SLOW_RETRY=true` into the
job environment built by `JobRequestFactory`.

This is the **producer** side of the toolbox checkout-resiliency work
(semaphoreci/toolbox#538 + follow-up semaphoreci/toolbox#539). The
toolbox `checkout` reads `SEMAPHORE_GIT_CLONE_SLOW_RETRY` to opt into
slow-clone detection + resilient retry (speed monitoring, retries,
alternative-endpoint fallback), and is a no-op when the var is absent.
So the rollout is: flip the flag for an org → zebra injects the var →
toolbox enables the behavior. Default off = unchanged behavior.

## How

Mirrors the existing `TestResults` / `:test_results_no_trim` pattern
exactly:

- New `Zebra.Workers.JobRequestFactory.GitCheckout` module with
`env_vars/1`, gated on
`FeatureProvider.feature_enabled?(:git_clone_slow_retry, param:
org_id)`.
- Appended to the env-var list in `JobRequestFactory` alongside
`TestResults.env_vars(org_id)`.
- Registered in the test `StubbedProvider` (hidden by default, enabled
for a dedicated test org id).

Deliberately *not* threading `org_id` through `Repository.*` — that
would churn the exact-match repository tests for no benefit. Only the
on/off switch is injected; the toolbox owns the tuning defaults
(threshold / timeout / grace / retries).

## Tests

`zebra/test/zebra/workers/job_request_factory/git_checkout_test.exs` —
feature disabled → `[]`; enabled →
`SEMAPHORE_GIT_CLONE_SLOW_RETRY=true`.

## Follow-up (not in this PR)

The `git_clone_slow_retry` feature must be registered in the feature
management backend before it can be toggled per-org in production.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants