Skip to content

Add 5xx retry policy for transient server errors#26821

Open
andrewmathew1 wants to merge 3 commits into
Azure:mainfrom
andrewmathew1:andrewmathew1/fix-5xx-retry-behavior
Open

Add 5xx retry policy for transient server errors#26821
andrewmathew1 wants to merge 3 commits into
Azure:mainfrom
andrewmathew1:andrewmathew1/fix-5xx-retry-behavior

Conversation

@andrewmathew1
Copy link
Copy Markdown

Description

Adds a retry policy in the clientRetryPolicy for 500, 502, and 504 responses. Only read operations are retried; writes are surfaced to the caller immediately.

The retry budget is one in-region retry followed by one cross-region retry. The cross-region retry only fires when cross-region retries are enabled and a preferred location is available to fail over to. After both attempts are exhausted (or skipped), the original 5xx response is returned wrapped as a non-retriable error, matching the pattern already used for 503 / 404 / 403 in this file.

Fixes #25639.

Adds a retry policy for 500, 502, and 504 responses in the azcosmos
client retry pipeline. Only read operations are retried, which is
consistent with the .NET, Java, and Python Cosmos SDKs.

The retry budget is one in-region retry followed by one cross-region
retry. The cross-region retry only fires when cross-region retries are
enabled and a preferred location is available to fail over to.

Fixes Azure#25639.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 20, 2026 03:11
@andrewmathew1 andrewmathew1 requested a review from a team as a code owner May 20, 2026 03:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds transient 5xx retry handling to the Cosmos client retry policy so read requests can recover from certain server-side failures while ensuring write requests are not retried.

Changes:

  • Added retry handling for HTTP 500, 502, and 504 in clientRetryPolicy, with an in-region retry followed by an optional cross-region retry.
  • Introduced serverErrorRetryCount to track the 5xx retry budget independently of other retry counters.
  • Added unit tests for read/write behavior on these 5xx responses and documented the change in the package changelog.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
sdk/data/azcosmos/cosmos_client_retry_policy.go Implements 5xx retry logic and tracking in the client retry policy.
sdk/data/azcosmos/cosmos_client_retry_policy_test.go Adds tests validating read retries and ensuring writes are not retried for 5xx.
sdk/data/azcosmos/CHANGELOG.md Notes the new 5xx retry behavior in the unreleased changelog.

Comment thread sdk/data/azcosmos/cosmos_client_retry_policy.go
andrewmathew1 and others added 2 commits May 19, 2026 23:26
Adds tests covering scenarios the original PR did not exercise:

* In-region retry succeeds without a cross-region attempt.
* Cross-region retries disabled - only the in-region retry runs.
* Exhausted retries surface `*azcore.ResponseError` with the
  original status code.
* 501 (Not Implemented) is not retried.
* 500 interleaved with 503 retries correctly across regions.

Together with the existing tests this brings
`attemptRetryOnServerError` and `shouldRetryStatus` to 100%
statement coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Addresses review feedback on the 5xx retry policy. Previously the
cross-region retry was gated only on `enableCrossRegionRetries` and
the length of `preferredLocations`, so single-region accounts (or
configurations where the only resolved read endpoint matches the
current endpoint) would burn the cross-region retry on the same URL
without actually failing over.

The cross-region retry now also requires more than one resolved read
endpoint in the location cache. The mock location-cache helper used in
tests is updated to populate `readEndpoints` / `writeEndpoints` per
available location so existing cross-region retry tests continue to
exercise the failover path, and a new test pins the single-endpoint
behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] 5xx retry policy

2 participants