Skip to content

feat(releases): retry fetchReleases on transient failures with backoff#111

Closed
tmchow wants to merge 1 commit intoopentofu:mainfrom
tmchow:osc/110-fetchreleases-retry-backoff
Closed

feat(releases): retry fetchReleases on transient failures with backoff#111
tmchow wants to merge 1 commit intoopentofu:mainfrom
tmchow:osc/110-fetchreleases-retry-backoff

Conversation

@tmchow
Copy link
Copy Markdown

@tmchow tmchow commented Apr 16, 2026

Summary

fetchReleases() in lib/releases.js made a single GET to https://get.opentofu.org/tofu/api.json and threw on any failure -- network error, HTTP 429, or any 5xx. Action runs would fail whenever the releases endpoint hiccuped. Wrap the fetch in an exponential-backoff retry loop so transient failures self-heal. Addresses #110.

Why

Per the issue thread, @diofeher marked this accepted and said:

The "rate-limiting" part is indeed true. While interacting with this API, we don't handle it. I'm going to put this issue as accepted so retry and back-off can be implemented.

Today the fetch path has three failure modes, all treated as permanent:

  1. http.get(...) throws (network reset, DNS, TLS) -> rethrown with no retry.
  2. Response with non-200 status -> rethrown with no retry.
  3. resp.readBody() throws mid-stream -> rethrown with no retry.

On shared CI, all three are frequently transient. A single retry with jitter would paper over the vast majority without changing steady-state behavior for callers.

Changes

Only lib/releases.js grows logic; interfaces are unchanged.

  • New private helpers:
    • _sleep(ms) -- promisified setTimeout.
    • _backoffDelayMs(attempt) -- exponential (500ms, 1s, 2s, ...) capped at 8s with up to 250ms jitter so concurrent runners don't resync on retry boundaries.
  • fetchReleases now loops up to maxAttempts (default 4):
    • Network-level throws from http.get -> sleep + retry until exhausted.
    • 200 response -> return as before.
    • 408 / 429 / 500 / 502 / 503 / 504 -> sleep + retry.
    • Other 4xx -> fail fast (unchanged).
    • readBody throws -> sleep + retry (treated as transient stream failure).
    • JSON.parse throws -> fail fast (retrying won't help malformed JSON).
  • maxAttempts, sleepFn, backoffFn are second-arg options, default to the production values. Tests inject a no-op sleepFn to keep the suite instant.
  • fetchReleases added to the named exports so tests can drive it directly.

The retry path is gated on status codes that the HTTP spec and GitHub's guidance call out as transient. Every other 4xx still fails on the first attempt; callers see the same error as before.

Testing

New lib/test/releases-fetch.test.js covers:

  • Success on first attempt -> no sleep calls.
  • 503 -> 503 -> 200 -> succeeds, sleeps twice with expected attempt values (0, 1).
  • 429 (rate-limit) -> 200 -> succeeds.
  • Network error thrown from http.get -> retry -> success.
  • 404 -> fails fast with no sleep (permanent).
  • Malformed JSON -> fails fast with no sleep (permanent).
  • maxAttempts=3 exhausted on three 503s -> throws last error, sleeps 2 times (between attempts 1-2 and 2-3).

Existing suites untouched: npm test reports 5 passed, 5 total / 41 tests. npm run build produces dist/ bundle; committed alongside the source change since this action ships prebuilt.

Fixes #110

Compound Engineering

@tmchow tmchow requested a review from a team as a code owner April 16, 2026 11:04
fetchReleases() calls get.opentofu.org/tofu/api.json once and throws on
any failure -- network error, 429 rate-limit, or 5xx. The issue (opentofu#110)
reports that this surfaces as flaky action runs whenever the API hiccups.

Wrap the fetch in an exponential-backoff retry loop (500ms, 1s, 2s,
capped at 8s, with up to 250ms jitter to avoid concurrent runners
re-syncing). Retry on:
  - Network-level errors thrown from http.get
  - HTTP 408 / 429 / 500 / 502 / 503 / 504
Fail fast on permanent errors (other 4xx) and on malformed JSON --
retrying wouldn't help those.

The options object (maxAttempts, sleepFn, backoffFn) is test-only so
the retry harness can drive deterministic scenarios without real sleeps.
Default behavior matches the maintainer's "accepted" direction in the
issue thread.

Fixes opentofu#110

Signed-off-by: Trevin Chow <trevin@trevinchow.com>
@tmchow tmchow force-pushed the osc/110-fetchreleases-retry-backoff branch from d1083de to 7013747 Compare April 16, 2026 11:30
@diofeher
Copy link
Copy Markdown
Member

diofeher commented Apr 16, 2026

Thanks for your contribution @tmchow, but per our policy, we don't accept LLM-generated PRs: https://github.com/opentofu/opentofu/blob/main/contributing/DEVELOPING.md#a-note-on-copyright

@diofeher diofeher closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Does not handle retry/backoff on fetchReleases

2 participants