Skip to content

docker: add blob-level mirror fallback in dockerImageSource#845

Open
QiWang19 wants to merge 1 commit into
podman-container-tools:mainfrom
QiWang19:fallback-mirror
Open

docker: add blob-level mirror fallback in dockerImageSource#845
QiWang19 wants to merge 1 commit into
podman-container-tools:mainfrom
QiWang19:fallback-mirror

Conversation

@QiWang19

@QiWang19 QiWang19 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

In disconnected/restricted environments, a mirror can pass the manifest
check in newImageSource() but then fail blob downloads (5xx, timeout,
or partial mirror returning BLOB_UNKNOWN). When this happens, the copy
fails and CRI-O propagates the error to kubelet — which retries from
scratch, picks the same broken mirror, and deadlocks.

This PR adds blob-level mirror fallback inside dockerImageSource so
that when a blob fetch fails, remaining configured mirrors are tried
before giving up. The fallback is transparent to all callers (CRI-O,
Podman, Buildah, Skopeo).

Design choices

Fallback triggers: Only transient errors (5xx, network timeout),
blob-not-found (BLOB_UNKNOWN), and rate limiting (429) trigger
fallback. Other 4xx errors (auth failures, etc.) are fatal — they
indicate a configuration problem, not a source issue.

Serialized mirror probing: getBlobWithMirrorFallback() holds
mirrorMu for the entire probe loop. This is intentional: the first
goroutine to hit a failure walks the mirror list while the rest block on
getActiveSource(), then reuse the discovered working mirror via
mirrorOverride. This avoids N goroutines independently exhausting the
mirror list.

No manifest re-fetch on fallback: Blob digests from the original
manifest are used against fallback sources. Safe because digests are
content-addressed. Fallback uses newPullClient() to create only a
client and reference, skipping the manifest fetch; configured mirrors
are expected to have the image in practice, and isMirrorFallbackError
handles BLOB_UNKNOWN to move to the next source if it doesn't.

remainingSources includes mirrors AND the primary location: The
fallback loop tries whatever pullSources entries come after the
initially-selected source, which may include the original upstream
registry.

Same-mirror retry on transient errors: tryGetBlob() retries once
with 1s+jitter delay before moving to the next source. This handles
brief 503 spikes without immediately triggering the full fallback
machinery.

Questions

Re: #845 (review)

  • Are the right errors triggering fallback in isMirrorTransientError() / isMirrorFallbackError()? Should any other error types be included or excluded?
  • The full-loop lock in getBlobWithMirrorFallback() serializes all blob goroutines during fallback — is this acceptable, or would you prefer a different concurrency strategy?
  • Should tryGetBlob() retry once on the same mirror before falling back, or should any failure immediately try the next source?

Fixes: https://issues.redhat.com/browse/OCPSTRAT-3170

@github-actions github-actions Bot added the image Related to "image" package label May 14, 2026

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Doing this at the copy.Image level is very surprising, and harder than necessary — and less resilient, because it still assumes that a full image copy can proceed uninterrupted from a single mirror.

I’d expect dockerImageSource, instead, to do the fallbacks — making physicalRef and the dockerClient uses mutable (careful about HasThreadSafeGetBlob: true!), or perhaps maintaining an on-demand-extended set of dockerClients, in the worst case one per mirror; then every GetBlob/… on the dockerImageSource could fall back to another mirror (and we can have endless tuning discussions about the tuning logic there)…

Either way the generic copy code, AFAICS, does not need to know.

Comment thread image/types/types.go Outdated
Comment on lines +682 to +685
// If not nil, DockerExcludedPullSources lists endpoint Location that NewImageSource skips when selecting a pull source.
DockerExcludedPullSources map[string]struct{}
// If DockerExcludedPullSources is not nil, the Location of the pull source selected by the most recent NewImageSource call.
DockerLastSelectedPullSource string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is conceptually an internal state of dockerImageSource and I see no reason to make these public and store them so far from there. In the very worst case, we could use a new private interface for that (compare #531 adding NewImageSourceWithOptions) but even that seems suboptimal, see elsewhere.

Comment thread image/copy/copy.go
if err != nil {
return nil, fmt.Errorf("initializing source %s: %w", transports.ImageName(srcRef), err)
}
rawSource := imagesource.FromPublic(publicRawSource)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea here is that we convert from types.ImageSource to private.ImageSource at the earliest opportunity, and then we never need to worry about public-only implementations (so much).

Comment thread image/copy/copy.go Outdated
signersToClose []*signer.Signer // Signers that should be closed when this copier is destroyed.
}

type pinnedManifestSource struct {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unnecessary in principle … dockerImageSource already has a cachedManifest (but it does require maintain that state carefully).

@mtrmac

mtrmac commented May 14, 2026

Copy link
Copy Markdown
Contributor

BTW the non-timeout test failures look real at a first glance.

@QiWang19 QiWang19 changed the title copy: Fall back to the next configured mirror on copy failure docker: add blob-level mirror fallback in dockerImageSource May 20, 2026
@QiWang19

Copy link
Copy Markdown
Contributor Author

@mtrmac I updated this PR blob-level mirror fallback. I'd appreciate an early review on the approach before I continue. And there are a few design questions I'd like your input on(in the PR description under "Questions.")

@mtrmac mtrmac left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I’m afraid there is way too much going on (e.g. we need to migrate of Cirrus) and I’ll be only able to pay proper attention to this, probably, weeks later.

For now just a very minimal look: Yes, I’d expect the state to be in dockerImageSource ~only. I didn’t carefully read the logic.

Are the right errors triggering fallback in isMirrorTransientError() / isMirrorFallbackError()?

Looks plausible; I’m sure we will be turning the heuristics 10 years from now… Anyway, thanks for clearly splitting the logic into separate documented functions, that will help future maintenance.

Should any other error types be included or excluded?

I think bodyReader already handles another set of errors. [There is a question whether bodyReader should also trigger a fallback to a different mirror… I think that would better be a different PR, and it might not even be necessary — OTOH there it might be worth just thinking a bit about what the code structure to enable that would look like.]

The full-loop lock in getBlobWithMirrorFallback() serializes all blob goroutines during fallback — is this acceptable, or would you prefer a different concurrency strategy?

Without looking at the details, I think this is a great idea.

Should tryGetBlob() retry once on the same mirror before falling back, or should any failure immediately try the next source?

Hum… in general, it’s problematic to have retry loops in lower levels of code; if each level of a call stack has a retry loop, that makes retry behavior exponential in the depth of the call stack. So as a general principle, I’ve been told that retries should exist only at the top level if possible. (Which is why GetBlob/GetManifest has no retries itself; callers can use c/common/pkg/retry at a higher level. OTOH bodyReader exists because the retry behavior in that case is usefully different from retrying from scratch.)

So I think GetBlob should ideally not retry at all if no mirrors are involved. If we do have mirrors, I don’t have a strong opinion. Given how costly / disruptive / potentially unexpected / (likely to fail in many configuration?) a fallback to another mirror can be, I think one more retry before starting the mirror fallback can well make sense.

Comment thread image/docker/docker_image_src.go Outdated
return s.getBlob(ctx, info, cache)
}

func (s *dockerImageSource) getBlob(ctx context.Context, info types.BlobInfo, cache types.BlobInfoCache) (io.ReadCloser, int64, error) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any other caller than GetBlob itself?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other caller except 'GetBlob'. Make getBlob inlined.

Comment thread image/docker/docker_image_src.go Outdated
Comment on lines +610 to +611
// newImageSourceAttempt calls ensureManifestIsLoaded(), which preserves
// the manifest-level filtering: only mirrors that can serve the manifest

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not?! If we got go the getBlob stage, cachedManifest is already set and ensure… does nothing.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed ensureManifestIsLoaded call in the fallnack loop.
ensureManifestIsLoaded was used to verify the fallback mirror had the image before trying blob fetches. Changed to just try the blob directly since most mirrors are expected to have the images in practice.

@QiWang19 QiWang19 force-pushed the fallback-mirror branch 2 times, most recently from e587a16 to 804b8c7 Compare June 9, 2026 21:14
When a blob fetch fails with a transient error (5xx, timeout),
a blob-not-found (BLOB_UNKNOWN), or rate limiting (429), try
remaining configured sources before giving up.

The source probe is serialized under mirrorMu: the first goroutine
to hit a failure walks the source list while the rest block on
getActiveSource(), then reuse the discovered working source.

Retry on the same source before fallback is only performed when
alternative sources are configured, to avoid adding lower-level
retries when callers already use c/common/pkg/retry.

Signed-off-by: Qi Wang <qiwan@redhat.com>
@QiWang19

Copy link
Copy Markdown
Contributor Author

@mtrmac PR is ready for review. PTAL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

image Related to "image" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants