Skip to content

build(compat): make Dockerfile.local go mod download resilient to proxy.golang.org HTTP/2 flakes#709

Merged
tsouza merged 1 commit into
mainfrom
fix/loki-compat-godl-proxy-flake
May 22, 2026
Merged

build(compat): make Dockerfile.local go mod download resilient to proxy.golang.org HTTP/2 flakes#709
tsouza merged 1 commit into
mainfrom
fix/loki-compat-godl-proxy-flake

Conversation

@tsouza
Copy link
Copy Markdown
Owner

@tsouza tsouza commented May 22, 2026

Summary

PR #708's compatibility/loki job (run 26306912141, job 77445902857)
failed before the harness ever ran:

#15 22.88 go: github.com/grpc-ecosystem/grpc-gateway/v2@v2.29.0: read
"https://proxy.golang.org/.../grpc-gateway/v2/@v/v2.29.0.zip":
stream error: stream ID 2015; INTERNAL_ERROR; received from peer
#15 ERROR: process "/bin/sh -c go mod download" did not complete successfully: exit code: 1

Not seed-settle (prior fixes #66 / #123 / #136 covered that). The
cerberus image build inside the compat compose stack tripped on a
transient proxy.golang.org HTTP/2 stream error during
RUN go mod download in Dockerfile.local. The Go module resolver
does not retry past a bad HTTP/2 frame, so it fails the whole compat
job.

All three compatibility harnesses (prom/loki/tempo) build cerberus
from this same Dockerfile via their docker-compose.yml. The fix is
structural and applies to every head — loki was just the unlucky one
this run.

Fix

Two changes to Dockerfile.local:

  1. Retry loop around go mod download — 5 attempts with linear
    backoff (3/6/9/12s). Surfaces the failure only if all 5 frames
    trip.
  2. BuildKit cache mounts for /go/pkg/mod and
    /root/.cache/go-build (sharing=locked because the three compat
    harnesses build this Dockerfile in parallel on the same runner).
    Warm runners skip the proxy entirely on subsequent builds, so the
    first-build surface is the only one a future flake can hit.

Mandate compliance: no timeout bump, no rerun-and-pray. The retry is
inside the build at the network layer (where the flake actually is),
not inside the harness at the seed-settle layer.

Test plan

  • compatibility/loki green
  • compatibility/prometheus green (same Dockerfile — confirms no regression)
  • compatibility/tempo green (ditto)
  • compose-smoke green (also builds Dockerfile.local indirectly)

…xy.golang.org HTTP/2 flakes

The three compatibility harnesses (prom/loki/tempo) all build cerberus
from Dockerfile.local on every CI run. The `RUN go mod download` step
has no retry logic and no module cache mount, so a single transient
`proxy.golang.org` HTTP/2 `stream error ... INTERNAL_ERROR; received
from peer` mid-stream takes the whole compat job down with it.

Observed on PR #708 / run 26306912141, compatibility/loki job
77445902857: `go: github.com/grpc-ecosystem/grpc-gateway/v2@v2.29.0:
read "https://proxy.golang.org/.../v2.29.0.zip": stream error;
INTERNAL_ERROR; received from peer`. The mandate is no-retry-rerun —
fix the underlying fragility instead of bandaiding.

Two structural changes to Dockerfile.local:

1. Wrap `go mod download` in a 5-attempt retry loop with linear
   backoff (3/6/9/12s). The Go module resolver does not retry past a
   bad HTTP/2 frame, so the wrapper is needed at the shell layer.
2. Add BuildKit `--mount=type=cache` for /go/pkg/mod and
   /root/.cache/go-build (sharing=locked because the three compat
   harnesses build this Dockerfile in parallel on the same runner).
   Warm caches mean transient proxy failures stop being possible on
   subsequent builds and the proxy hit surface narrows to first-build
   only.

This is a fix to a flake class, not a single point; the same outage
would have hit prom or tempo if the unlucky frame had landed there
first.
@tsouza tsouza enabled auto-merge (squash) May 22, 2026 19:18
@tsouza tsouza merged commit 5824531 into main May 22, 2026
21 checks passed
@tsouza tsouza deleted the fix/loki-compat-godl-proxy-flake branch May 22, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant