Skip to content

update to latest buildkit#442

Open
gilescope wants to merge 182 commits into
mainfrom
giles-update-latest-buildkit
Open

update to latest buildkit#442
gilescope wants to merge 182 commits into
mainfrom
giles-update-latest-buildkit

Conversation

@gilescope

@gilescope gilescope commented Apr 10, 2026

Copy link
Copy Markdown

Makes ubuntu-latest CI reliably green by fixing the actual flake sources, and points buildkitd at a diagnostics-enabled BuildKit fork revision so the next unknown failure names its root cause instead of context canceled.

Companion fork PR: EarthBuild/buildkit#14

Merge these PRs first to reduce diff size

Each is an independent, self-contained extraction that lands on current main without the bump:

Once the above land, this PR is reduced to the genuine bump: go.mod/go.sum (buildkit/containerd v2/grpc 1.80), buildkitd, the BuildKit-API adaptations (entitlements, protobuf getters, ALPN), and the OOM/retry CI tuning.

Flake classes fixed

  1. Network-fetch flake (dominant). +base fetched the Go toolchain with a bare wget; dl.google.com drops connections mid-transfer, +base dies, and every dependent target reports context canceled — which made BuildKit look guilty. Downloads now resume (wget -c) and retry with backoff; GNU curl sites get --retry --retry-all-errors. Same treatment for zig (ziglang.org throttles CI), gh, kind, antlr, golangci-lint. (Split out as ci: retry/back off toolchain downloads #577.)
  2. Data race. Package-level shared cases.Caser in util/stringutil (x/text Casers are stateful, not goroutine-safe) panicked -race jobs with slice bounds out of range. Now constructed per call; concurrent regression test added. (Split out as fix: data race squashed #567.)
  3. Nested-build cancellation. BuildKit fork revision 85c7359 preserves first non-cancellation root causes across cancellation fan-out (exec/cache/gateway/session paths), so genuine BuildKit failures surface with target/command context. Workflow retry harness restarts buildkitd between attempts. (earth-side diagnostics split out as feat: surface BuildKit root-cause on cancellation #574.)

Also in this branch

  • CI memory telemetry, swap headroom, serialized slow nested test groups, staged buildkitd reuse.
  • Exit-code-126 explanations and edge-merge subbuild fixes in the fork.

Verification

  • go test -race ./util/stringutil/ red before the caser fix, green after (×3 runs).
  • Retry-loop control flow tested for first-try success / eventual success / total failure; busybox wget flag support verified against alpine:3.18.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
@gilescope gilescope requested a review from a team as a code owner April 10, 2026 13:16
@gilescope gilescope requested review from kmannislands and removed request for a team April 10, 2026 13:16
@github-actions

github-actions Bot commented Apr 10, 2026

Copy link
Copy Markdown

⚠️ Are we earthbuild yet?

Warning: "earthly" occurrences have increased by 186 (3.48%)

📈 Overall Progress

Branch Total Count
main 5346
This PR 5532
Difference +186 (3.48%)

📁 Changes by file type:

File Type Change
Go files (.go) ❌ +6
Documentation (.md) ❌ +14
Earthfiles ➖ No change

Keep up the great work migrating from Earthly to Earthbuild! 🚀

💡 Tips for finding more occurrences

Run locally to see detailed breakdown:

./.github/scripts/count-earthly.sh

Note that the goal is not to reach 0.
There is anticipated to be at least some occurences of earthly in the source code due to backwards compatibility with config files and language constructs.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates repository references and URLs from the 'earthly' organization to 'EarthBuild' across documentation and build configurations. It also performs a significant update of Go dependencies, including upgrading gRPC to v1.80.0 and updating various containerd and Docker-related packages. A bug was identified in the buildkitd/Earthfile where a log message references an undefined variable ${BUILDKIT_BRANCH} instead of ${BUILDKIT_GIT_BRANCH}.

Comment thread buildkitd/Earthfile
echo "looking up branch $BUILDKIT_GIT_BRANCH"; \
buildkit_sha1=$(git ls-remote --refs -q https://github.com/$BUILDKIT_GIT_ORG/buildkit.git "$BUILDKIT_GIT_BRANCH" | awk 'BEGIN { FS = "[ \t]+" } {print $1}'); \
echo "pinning github.com/earthly/buildkit@${BUILDKIT_BRANCH} to reference git sha1: $buildkit_sha1"; \
echo "pinning github.com/${BUILDKIT_GIT_ORG}/buildkit@${BUILDKIT_BRANCH} to reference git sha1: $buildkit_sha1"; \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The variable ${BUILDKIT_BRANCH} is used in this log message, but the argument defined in this scope is BUILDKIT_GIT_BRANCH. This will result in an empty string being printed for the branch name in the logs. Since you are already modifying this line to parameterize the organization, you should also correct the branch variable name.

            echo "pinning github.com/${BUILDKIT_GIT_ORG}/buildkit@${BUILDKIT_GIT_BRANCH} to reference git sha1: $buildkit_sha1"; \

Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Alpine 3.22 moved iptables from /sbin to /usr/sbin.

Signed-off-by: Giles Cope <gilescope@gmail.com>
- Update buildkit fork ref to f4ec24bc (includes GRPC_ENFORCE_ALPN_ENABLED=false)
- Disable gRPC ALPN enforcement in earthly client and buildkitd entrypoint
  for backwards compat with older grpc-go during upgrade transition
- Search /usr/sbin in addition to /sbin for iptables (Alpine 3.22 change)
- Bump all CI EARTHLY_BUILDKIT_IMAGE refs to v0.8.17-fix.4

Signed-off-by: Giles Cope <gilescope@gmail.com>
@gilescope gilescope force-pushed the giles-update-latest-buildkit branch from 990ef27 to c42959c Compare April 11, 2026 05:45
Older earth binaries pass EARTHLY_ADDITIONAL_BUILDKIT_CONFIG with the
TOML section header and key on the same line (e.g.
[registry."docker.io"] mirrors = [...]).  The new buildkit's TOML
parser requires a newline after section headers.  Post-process the
generated buildkitd.toml to split these.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Upstream buildkit added a strict verifier that rejects multiple refs
without platform mapping.  Earthly's multi-BUILD pattern legitimately
produces this.  The buildkit fork now downgrades this to a warning.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
@gilescope

Copy link
Copy Markdown
Author

21 green CI jobs! It's a start.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Three concurrent Go compilations (buildkitd, ticktock-buildkitd,
earthly) exceed the 16GB runner memory under Podman, causing OOM
kills that manifest as silent cancellations.

Signed-off-by: Giles Cope <gilescope@gmail.com>
The old SHA 88ecf5d6 is incompatible with the current codebase:
missing client/llb/sourceresolver package and containerd API
version conflicts.  Point to the same buildkit fork commit
(da92d3419) used by the main build.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Three concurrent Go compilations (buildkitd, buildctl, earthly) exceed
runner memory under Podman.  Limiting max-parallelism to 2 serialises
the heaviest compilation steps, keeping peak memory within bounds.

Signed-off-by: Giles Cope <gilescope@gmail.com>
The build-earthly parallelism fix only applied to the build step.
Test jobs bootstrap their own buildkitd via stage2-setup, so they
also need the parallelism limit to avoid OOM during Go compilations.

Signed-off-by: Giles Cope <gilescope@gmail.com>
The earthly-next build is even heavier than normal (update-buildkit +
two buildkitd variants + earthly). Apply max-parallelism=2 to prevent
OOM on standard CI runners.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Move GCR mirror config and max-parallelism before bootstrap so
buildkitd starts with the correct settings first time, avoiding
a restart that may not pick up max-parallelism correctly.

Signed-off-by: Giles Cope <gilescope@gmail.com>
Docker earthly-next tests also OOM with default parallelism of 20.
Apply the limit for all CI builds, not just Podman/earthly-next.

Signed-off-by: Giles Cope <gilescope@gmail.com>
gilescope and others added 14 commits June 11, 2026 08:19
builder/solver.go ran two errgroup goroutines: bkClient.Build and
MonitorProgress. MonitorProgress also returns errors from earth's own
status processing (e.g. bp.NewCommand). When it aborts, it cancels the
shared errgroup context, so bkClient.Build then returns a bare
'context canceled' — and the old code preferred that buildErr,
discarding the real monitor error in eg.Wait()'s result. An earth-side
self-cancellation was thus misreported as 'BuildKit lost the session'.

chooseSolveError now prefers a non-cancellation monitor error over a
canceled build error. Doubles as a probe: a class-3 failure caused this
way will now name its real cause instead of the cancellation veil.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
not-a-unit-test.sh ran 'go test' with the default -p (GOMAXPROCS = host
CPUs), compiling and linking many test binaries at once. Nested inside
an earthly build on a 4-core/16G CI runner, that RSS spike is what tips
the box into memory pressure; the kill then cascades as a lost solve
session at this exact vertex. Cap -p to 2 (override via
GO_TEST_PARALLELISM) to flatten the peak.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…uildkit

Adopts main's base->go/node Earthfile refactor (FROM golang:1.26.4-alpine3.24
eliminates the Go-tarball wget that was flake class #1), keeping the
EarthBuild/buildkit diagnostics pin (79762ff4c), the WITH-DOCKER nested test
wrapper plus EARTHLY_SKIP_BUILDKIT_CLI_TESTS, and the Docker Hub login
continue-on-error. Action versions taken from main.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Earthfile merge resolver dropped the newline between
'RUN apk add ... git' and 'ENV EARTHLY_IMAGE=true', joining them onto one
line. That swallowed the ENV into the RUN command and corrupted
+earthly-docker, cascading 'requires a FROM' failures through every
target that builds the inner earthly via +earthly-integration-test-base.
Split the lines back (keeping main's --no-cache apk flag). The
IF-before-FROM in the integration base is valid and unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main's base->go refactor removed the file-level 'FROM alpine' base recipe.
That base set RanFromLike=true for every target, which the interpreter's
checkAllowed guard requires before an IF (the condition runs in a shell).
Two targets rely on an IF before their first FROM to choose a buildkit
image conditionally: +earthly-docker and +earthly-integration-test-base.
Without the implicit base they now fail 'requires a FROM', cascading
through every test that builds the inner earthly.

Add a scaffold 'FROM alpine:3.24.0' to each (replaced by the in-IF FROM),
mirroring buildkitd/Earthfile which kept its file-level FROM. Verified
against converter.go checkAllowed: a prior FROM sets RanFromLike and
unblocks the IF.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Resolves go.mod fsutil require (the active version is the EarthBuild
fsutil replace directive regardless). Carries the IF-before-FROM scaffold
fix so the A+B probe can finally run clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ons)

Root cause of the recurring 'BuildKit canceled or lost the solve session'
failures, finally surfaced by the errgroup attribution fix
(chooseSolveError): the build aborts with

    earth progress monitor aborted the build: failed decoding stats
    stream: unexpected stats stream protocol version 123

123 is 0x7B, '{': the daemon's runc stats collector hits EOF (the
recurring 'runc stats collection error: EOF' in buildkitd logs) and
emits a raw/partial frame where the versioned framing
([0x01][uint32 len][JSON]) is expected. vertexMonitor.Write returned
that decode error, which propagates through MonitorProgress, cancels the
errgroup, kills the running exec (exit 137), and reports a bogus lost
session.

Stats are diagnostic telemetry and must never abort a build. Drop the
bad batch and re-sync (new Parser.Reset) instead. Red test reproduces a
raw '{' frame and the recovery.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
build-earthly bootstraps with the released earth (v0.8.17), which lacks
the stats-stream non-fatal fix, so it can still hit a class-3 'Canceled'
when driving the fork's buildkitd. It also occasionally hits a transient
exit-126 exec failure in the go build. With only 2 attempts a run can
exhaust both on different flakes (seen on 00bd4b0: attempt 1 Canceled,
attempt 2 exit 126), skipping the whole downstream suite. A third
attempt makes the bootstrap absorb these non-deterministic failures.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
main bumped the moby/buildkit *require* to v0.30.0 (#563); kept that line
but preserved our replace directive pinning the EarthBuild diagnostics
fork (79762ff4c, the stats/cancellation work this branch depends on) and
the docker-image-spec require. go.sum regenerated via go mod tidy. Also
brings docker/cli v29.5.3, alpine 3.24.0, npm 11.17.0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
Signed-off-by: Giles Cope <gilescope@gmail.com>
… hooks

Signed-off-by: Giles Cope <gilescope@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-assisted Authored with AI assistance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants