Skip to content

fix: detect containerd socket errors and add Docker Desktop pre-flight warning for all Solo commands#4568

Open
Copilot wants to merge 19 commits into
mainfrom
copilot/fix-redeploy-failure-relation
Open

fix: detect containerd socket errors and add Docker Desktop pre-flight warning for all Solo commands#4568
Copilot wants to merge 19 commits into
mainfrom
copilot/fix-redeploy-failure-relation

Conversation

Copilot AI commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

On macOS with Docker Desktop's "Use containerd for pulling and storing images" toggle enabled, relay pods fail with ImageInspectError pointing at /run/containerd/containerd.sock. Previously the error was silently swallowed as transient, causing a 15-minute silent timeout before a generic "pod not found" message. This PR surfaces an actionable error message immediately and adds a pre-flight check that runs automatically for every Solo command via the shared dependency-check flow.

Description

This pull request changes the following:

src/integration/kube/k8-client/resources/pod/k8-client-pods.ts

  • isContainerdSocketError() — exported helper matching containerd socket error strings in waiting-state messages
  • CONTAINERD_SOCKET_FATAL_THRESHOLD = 5 — higher streak threshold (vs default 3) to tolerate Docker Desktop startup races
  • detectFatalContainerErrorImageInspectError + containerd socket message now returns an actionable string with fix instructions instead of continuing silently
  • waitForRunningPhase — picks CONTAINERD_SOCKET_FATAL_THRESHOLD when the fatal error is a containerd socket error

src/core/docker-desktop-containerd-check-result.ts (new)

  • DockerDesktopContainerdCheckResult interface extracted into its own kebab-case file per the TypeScript style guide (§ 3.5)

src/core/docker-desktop-checker.ts (new)

  • checkDockerDesktopContainerdSetting() reads Docker Desktop settings files and checks useContainerdSnapshotter
  • On macOS: probes ~/.docker/settings-store.json, ~/.docker/settings.json, and the legacy ~/Library/Group Containers/group.com.docker/settings.json
  • On Windows: probes %APPDATA%\Docker\settings-store.json and %APPDATA%\Docker\settings.json first, then the ~/.docker/ equivalents
  • No-ops on Linux; uses PathEx.join (not path.join) per the TypeScript style guide (§ 3.3.5); paths evaluated lazily so test stubs on os.homedir work correctly

src/commands/base.ts

  • dockerDesktopPreflightTask() added as a protected method on BaseCommand, making it available to every Solo command that extends BaseCommand via standard inheritance
  • The task calls the checker and emits a warn-level log with fix instructions if the containerd snapshotter setting is on (non-blocking — warns only, does not halt the command)

src/commands/init/init.ts

  • installDependenciesTasks() now includes dockerDesktopPreflightTask() as its first task — since Subcommand.installDependencies() in command-builder.ts calls this method before every Solo sub-command handler, the Docker Desktop pre-flight check runs automatically for all Solo commands without any per-command wiring

src/commands/relay.ts

  • Dedicated this.dockerDesktopPreflightTask() calls removed from add and upgrade task flows — now covered universally via installDependenciesTasks()

docs/site/content/en/system-readiness.md (new)

  • Documents the Docker Desktop containerd issue: symptoms, root cause, fix steps, and Solo's pre-flight behavior

Related Issues

Closed #4351

Pull request (PR) checklist

  • This PR added tests (unit, integration, and/or end-to-end)
  • This PR updated documentation
  • This PR added no TODOs or commented out code
  • This PR has no breaking changes
  • Any technical debt has been documented as a separate issue and linked to this PR
  • Any package.json changes have been explained to and approved by a repository manager
  • All related issues have been linked to this PR
  • All changes in this PR are included in the description
  • When this PR merges the commits will be squashed and the title will be used as the commit message, the 'commit message guidelines' below have been followed

Testing

  • This PR added unit tests
  • This PR added integration/end-to-end tests
  • These changes required manual testing that is documented below
  • Anything not tested is documented

The following manual testing was done:

  • Verified isContainerdSocketError matches the exact socket path string from the filed bug report
  • Verified all 6 checkDockerDesktopContainerdSetting unit tests and 6 detectFatalContainerError containerd-branch unit tests pass
  • Verified TypeScript compilation and ESLint pass cleanly after style guide and Windows compatibility fixes
  • Verified dockerDesktopPreflightTask() runs for all commands via installDependenciesTasks() with no regressions across 11 passing unit tests

The following was not tested:

  • End-to-end: live Docker Desktop with useContainerdSnapshotter: true toggled on — cannot be reproduced in CI; behavior verified via unit tests and code review
  • Windows: live Docker Desktop settings path resolution on a real Windows machine — correct %APPDATA%\Docker\ paths are exercised via unit tests with stubbed OS detection

@trunk-io

trunk-io Bot commented Jun 3, 2026

Copy link
Copy Markdown

✨ Submitted to Merge by @JeffreyDallas. It will be added to the merge queue once all branch protection rules pass and there are no merge conflicts with the target branch. See more details here.

Copilot AI changed the title [WIP] Fix relay deploy failure due to containerd.sock connection error fix: detect containerd socket errors in relay deploy with Docker Desktop pre-flight warning Jun 3, 2026
Copilot AI requested a review from JeffreyDallas June 3, 2026 15:10
Signed-off-by: JeffreyDallas <39912573+JeffreyDallas@users.noreply.github.com>
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Linux

38 tests  ±0   38 ✅ ±0   0s ⏱️ ±0s
17 suites ±0    0 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit 07e8b03. ± Comparison against base commit ee99636.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Windows

    1 files  ± 0    339 suites  +2   11s ⏱️ -1s
1 080 tests +12  1 080 ✅ +12  0 💤 ±0  0 ❌ ±0 
1 084 runs  +12  1 084 ✅ +12  0 💤 ±0  0 ❌ ±0 

Results for commit 07e8b03. ± Comparison against base commit ee99636.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Test Report

 10 files  ±0   94 suites  ±0   1h 21m 13s ⏱️ - 5m 10s
298 tests ±0  298 ✅ ±0  0 💤 ±0  0 ❌ ±0 
316 runs  ±0  316 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit bbffcb4. ± Comparison against base commit c3e25c5.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

E2E Test Report

 10 files  ±0   94 suites  ±0   1h 28m 23s ⏱️ +39s
302 tests ±0  302 ✅ ±0  0 💤 ±0  0 ❌ ±0 
321 runs  ±0  321 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit 07e8b03. ± Comparison against base commit ee99636.

♻️ This comment has been updated with latest results.

JeffreyDallas

This comment was marked as outdated.

This comment was marked as outdated.

@JeffreyDallas JeffreyDallas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot the pre-flight check should not just for relay node, maybe more general for all solo command, considering move it common function that will be called by all solo command

Copilot AI commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@copilot the pre-flight check should not just for relay node, maybe more general for all solo command, considering move it common function that will be called by all solo command

Done in commit refactor: move dockerDesktopPreflightTask to BaseCommand for all commands.

dockerDesktopPreflightTask() is now a protected method on BaseCommand (with its imports moved there too), so every Solo command class that extends BaseCommand can include this.dockerDesktopPreflightTask() in its task list. The RelayCommand no longer defines the method itself — it simply calls it via inheritance.

@JeffreyDallas JeffreyDallas left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot
update PR description with current changes

@jan-milenkov jan-milenkov added the PR: Checks Failed A pull request where the checks have failed. label Jun 18, 2026
Comment thread src/integration/kube/k8-client/resources/pod/k8-client-pods.ts
Comment thread src/commands/init/init.ts
Comment thread src/integration/kube/k8-client/resources/pod/k8-client-pods.ts Outdated
Comment thread src/integration/kube/k8-client/resources/pod/k8-client-pods.ts Outdated
Comment thread src/integration/kube/k8-client/resources/pod/k8-client-pods.ts Outdated
Comment thread src/commands/base.ts Outdated
Comment thread src/commands/base.ts Outdated
@jeromy-cannon jeromy-cannon added PR: Unresolved Comments A pull request where there are comments and they need to be resolved. and removed PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 18, 2026
@jeromy-cannon jeromy-cannon marked this pull request as draft June 18, 2026 17:30
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas added PR: Needs Team Approval A pull request that needs review from a team member. and removed PR: Unresolved Comments A pull request where there are comments and they need to be resolved. labels Jun 18, 2026
@JeffreyDallas JeffreyDallas marked this pull request as ready for review June 18, 2026 18:26
@JeffreyDallas JeffreyDallas removed the PR: Checks Failed A pull request where the checks have failed. label Jun 18, 2026
@jeromy-cannon jeromy-cannon added the P1-💎 Current Milestone & Goals label Jun 19, 2026
@jeromy-cannon jeromy-cannon marked this pull request as draft June 19, 2026 15:48
@jeromy-cannon jeromy-cannon added PR: Merge Conflicts A pull request that has merge conflicts that need to be resolved. and removed PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 19, 2026
…fix-redeploy-failure-relation

Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>

# Conflicts:
#	src/integration/kube/k8-client/resources/pod/k8-client-pods.ts
#	test/unit/integration/kube/detect-fatal-container-error.test.ts
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas removed the PR: Merge Conflicts A pull request that has merge conflicts that need to be resolved. label Jun 22, 2026
@JeffreyDallas JeffreyDallas marked this pull request as ready for review June 22, 2026 20:29
@JeffreyDallas

Copy link
Copy Markdown
Contributor

Wait #4798 to fix migration test

@JeffreyDallas JeffreyDallas added Blocked Further development work is blocked by other item and removed PR: Merging... labels Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Blocked Further development work is blocked by other item P1-💎 Current Milestone & Goals

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Relay deploy fails with `containerd.sock: connection refused

4 participants