docker: add blob-level mirror fallback in dockerImageSource by QiWang19 · Pull Request #845 · podman-container-tools/container-libs

QiWang19 · 2026-05-14T03:46:03Z

Summary

In disconnected/restricted environments, a mirror can pass the manifest
check in newImageSource() but then fail blob downloads (5xx, timeout,
or partial mirror returning BLOB_UNKNOWN). When this happens, the copy
fails and CRI-O propagates the error to kubelet — which retries from
scratch, picks the same broken mirror, and deadlocks.

This PR adds blob-level mirror fallback inside dockerImageSource so
that when a blob fetch fails, remaining configured mirrors are tried
before giving up. The fallback is transparent to all callers (CRI-O,
Podman, Buildah, Skopeo).

Design choices

Fallback triggers: Only transient errors (5xx, network timeout),
blob-not-found (BLOB_UNKNOWN), and rate limiting (429) trigger
fallback. Other 4xx errors (auth failures, etc.) are fatal — they
indicate a configuration problem, not a source issue.

Serialized mirror probing: getBlobWithMirrorFallback() holds
mirrorMu for the entire probe loop. This is intentional: the first
goroutine to hit a failure walks the mirror list while the rest block on
getActiveSource(), then reuse the discovered working mirror via
mirrorOverride. This avoids N goroutines independently exhausting the
mirror list.

No manifest re-fetch on fallback: Blob digests from the original
manifest are used against fallback sources. Safe because digests are
content-addressed. Fallback uses newPullClient() to create only a
client and reference, skipping the manifest fetch; configured mirrors
are expected to have the image in practice, and isMirrorFallbackError
handles BLOB_UNKNOWN to move to the next source if it doesn't.

remainingSources includes mirrors AND the primary location: The
fallback loop tries whatever pullSources entries come after the
initially-selected source, which may include the original upstream
registry.

Same-mirror retry on transient errors: tryGetBlob() retries once
with 1s+jitter delay before moving to the next source. This handles
brief 503 spikes without immediately triggering the full fallback
machinery.

Questions

Re: #845 (review)

Are the right errors triggering fallback in isMirrorTransientError() / isMirrorFallbackError()? Should any other error types be included or excluded?
The full-loop lock in getBlobWithMirrorFallback() serializes all blob goroutines during fallback — is this acceptable, or would you prefer a different concurrency strategy?
Should tryGetBlob() retry once on the same mirror before falling back, or should any failure immediately try the next source?

Fixes: https://issues.redhat.com/browse/OCPSTRAT-3170

mtrmac

Thanks!

Doing this at the copy.Image level is very surprising, and harder than necessary — and less resilient, because it still assumes that a full image copy can proceed uninterrupted from a single mirror.

I’d expect dockerImageSource, instead, to do the fallbacks — making physicalRef and the dockerClient uses mutable (careful about HasThreadSafeGetBlob: true!), or perhaps maintaining an on-demand-extended set of dockerClients, in the worst case one per mirror; then every GetBlob/… on the dockerImageSource could fall back to another mirror (and we can have endless tuning discussions about the tuning logic there)…

Either way the generic copy code, AFAICS, does not need to know.

mtrmac · 2026-05-14T14:34:24Z

+	// If not nil, DockerExcludedPullSources lists endpoint Location that NewImageSource skips when selecting a pull source.
+	DockerExcludedPullSources map[string]struct{}
+	// If DockerExcludedPullSources is not nil, the Location of the pull source selected by the most recent NewImageSource call.
+	DockerLastSelectedPullSource string


This is conceptually an internal state of dockerImageSource and I see no reason to make these public and store them so far from there. In the very worst case, we could use a new private interface for that (compare #531 adding NewImageSourceWithOptions) but even that seems suboptimal, see elsewhere.

mtrmac · 2026-05-14T14:36:05Z

 	if err != nil {
 		return nil, fmt.Errorf("initializing source %s: %w", transports.ImageName(srcRef), err)
 	}
-	rawSource := imagesource.FromPublic(publicRawSource)


The idea here is that we convert from types.ImageSource to private.ImageSource at the earliest opportunity, and then we never need to worry about public-only implementations (so much).

mtrmac · 2026-05-14T14:37:40Z

 	signersToClose                []*signer.Signer    // Signers that should be closed when this copier is destroyed.
 }

+type pinnedManifestSource struct {


This seems unnecessary in principle … dockerImageSource already has a cachedManifest (but it does require maintain that state carefully).

mtrmac · 2026-05-14T14:49:17Z

BTW the non-timeout test failures look real at a first glance.

QiWang19 · 2026-05-20T07:14:36Z

@mtrmac I updated this PR blob-level mirror fallback. I'd appreciate an early review on the approach before I continue. And there are a few design questions I'd like your input on(in the PR description under "Questions.")

mtrmac

Thanks!

I’m afraid there is way too much going on (e.g. we need to migrate of Cirrus) and I’ll be only able to pay proper attention to this, probably, weeks later.

For now just a very minimal look: Yes, I’d expect the state to be in dockerImageSource ~only. I didn’t carefully read the logic.

Are the right errors triggering fallback in isMirrorTransientError() / isMirrorFallbackError()?

Looks plausible; I’m sure we will be turning the heuristics 10 years from now… Anyway, thanks for clearly splitting the logic into separate documented functions, that will help future maintenance.

Should any other error types be included or excluded?

I think bodyReader already handles another set of errors. [There is a question whether bodyReader should also trigger a fallback to a different mirror… I think that would better be a different PR, and it might not even be necessary — OTOH there it might be worth just thinking a bit about what the code structure to enable that would look like.]

The full-loop lock in getBlobWithMirrorFallback() serializes all blob goroutines during fallback — is this acceptable, or would you prefer a different concurrency strategy?

Without looking at the details, I think this is a great idea.

Should tryGetBlob() retry once on the same mirror before falling back, or should any failure immediately try the next source?

Hum… in general, it’s problematic to have retry loops in lower levels of code; if each level of a call stack has a retry loop, that makes retry behavior exponential in the depth of the call stack. So as a general principle, I’ve been told that retries should exist only at the top level if possible. (Which is why GetBlob/GetManifest has no retries itself; callers can use c/common/pkg/retry at a higher level. OTOH bodyReader exists because the retry behavior in that case is usefully different from retrying from scratch.)

So I think GetBlob should ideally not retry at all if no mirrors are involved. If we do have mirrors, I don’t have a strong opinion. Given how costly / disruptive / potentially unexpected / (likely to fail in many configuration?) a fallback to another mirror can be, I think one more retry before starting the mirror fallback can well make sense.

mtrmac · 2026-05-20T17:04:51Z

+	return s.getBlob(ctx, info, cache)
+}
+
+func (s *dockerImageSource) getBlob(ctx context.Context, info types.BlobInfo, cache types.BlobInfoCache) (io.ReadCloser, int64, error) {


Is there any other caller than GetBlob itself?

no other caller except 'GetBlob'. Make getBlob inlined.

mtrmac · 2026-05-20T17:10:54Z

+		// newImageSourceAttempt calls ensureManifestIsLoaded(), which preserves
+		// the manifest-level filtering: only mirrors that can serve the manifest


Probably not?! If we got go the getBlob stage, cachedManifest is already set and ensure… does nothing.

Removed ensureManifestIsLoaded call in the fallnack loop.
ensureManifestIsLoaded was used to verify the fallback mirror had the image before trying blob fetches. Changed to just try the blob directly since most mirrors are expected to have the images in practice.

When a blob fetch fails with a transient error (5xx, timeout), a blob-not-found (BLOB_UNKNOWN), or rate limiting (429), try remaining configured sources before giving up. The source probe is serialized under mirrorMu: the first goroutine to hit a failure walks the source list while the rest block on getActiveSource(), then reuse the discovered working source. Retry on the same source before fallback is only performed when alternative sources are configured, to avoid adding lower-level retries when callers already use c/common/pkg/retry. Signed-off-by: Qi Wang <qiwan@redhat.com>

QiWang19 · 2026-06-10T03:36:31Z

@mtrmac PR is ready for review. PTAL

github-actions Bot added the image Related to "image" package label May 14, 2026

mtrmac requested changes May 14, 2026

View reviewed changes

QiWang19 force-pushed the fallback-mirror branch from 246c4b9 to 71dc1f5 Compare May 20, 2026 07:05

QiWang19 changed the title ~~copy: Fall back to the next configured mirror on copy failure~~ docker: add blob-level mirror fallback in dockerImageSource May 20, 2026

mtrmac reviewed May 20, 2026

View reviewed changes

QiWang19 force-pushed the fallback-mirror branch 2 times, most recently from e587a16 to 804b8c7 Compare June 9, 2026 21:14

QiWang19 force-pushed the fallback-mirror branch from 4346e3c to b2461ab Compare June 9, 2026 21:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker: add blob-level mirror fallback in dockerImageSource#845

docker: add blob-level mirror fallback in dockerImageSource#845
QiWang19 wants to merge 1 commit into
podman-container-tools:mainfrom
QiWang19:fallback-mirror

QiWang19 commented May 14, 2026 •

edited

Loading

Uh oh!

mtrmac left a comment

Uh oh!

mtrmac May 14, 2026

Uh oh!

mtrmac May 14, 2026

Uh oh!

mtrmac May 14, 2026

Uh oh!

mtrmac commented May 14, 2026

Uh oh!

QiWang19 commented May 20, 2026

Uh oh!

mtrmac left a comment

Uh oh!

mtrmac May 20, 2026

Uh oh!

QiWang19 Jun 9, 2026

Uh oh!

mtrmac May 20, 2026

Uh oh!

QiWang19 Jun 9, 2026

Uh oh!

QiWang19 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// newImageSourceAttempt calls ensureManifestIsLoaded(), which preserves
		// the manifest-level filtering: only mirrors that can serve the manifest

Conversation

QiWang19 commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Design choices

Questions

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

mtrmac May 14, 2026

Choose a reason for hiding this comment

Uh oh!

mtrmac May 14, 2026

Choose a reason for hiding this comment

Uh oh!

mtrmac May 14, 2026

Choose a reason for hiding this comment

Uh oh!

mtrmac commented May 14, 2026

Uh oh!

QiWang19 commented May 20, 2026

Uh oh!

mtrmac left a comment

Choose a reason for hiding this comment

Uh oh!

mtrmac May 20, 2026

Choose a reason for hiding this comment

Uh oh!

QiWang19 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

mtrmac May 20, 2026

Choose a reason for hiding this comment

Uh oh!

QiWang19 Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

QiWang19 commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

QiWang19 commented May 14, 2026 •

edited

Loading