Skip to content

fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712

Merged
jeromy-cannon merged 14 commits into
mainfrom
fix/github-api-rate-limit-retry
Jun 22, 2026
Merged

fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712
jeromy-cannon merged 14 commits into
mainfrom
fix/github-api-rate-limit-retry

Conversation

@JeffreyDallas

@JeffreyDallas JeffreyDallas commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Description

This pull request changes the following:

  • Adds GitHubApiClient (src/core/github-api-client.ts) with GITHUB_TOKEN auth and automatic retry on 403/429 responses, fixing GitHub API rate-limit failures in CI dependency managers
  • Fixes Windows one-shot CI: adds setServiceEndpoints (ClusterIP) to the existing NodeUpdateTransaction in setGrpcWebEndpoint, so the mirror node address book carries a routable IP instead of the bootstrap FQDN. hiero-sdk-go v2.80.0 introduced eager gRPC dialing at ClientForNetwork startup; on Windows Kind/WSL2 the FQDN TCP dial hangs ~13 min (kernel retransmit timeout), blocking pinger readiness. After the NodeUpdate, the importer writes the ClusterIP to file 0.0.102 and pinger connects immediately on its next restart.
  • Increases mirror pinger pod readiness timeout from 15 → 30 minutes (MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS: 450 → 900 × 2 s, env-var overridable) to accommodate image-load overhead on Windows runners introduced by ebe4534e1
  • Adds separate pinger-specific readiness constants (MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS, MIRROR_NODE_PINGER_PODS_READY_DELAY) so pinger wait can be tuned independently of other pod readiness checks
  • Increases NodesStarted event wait timeout to 30 minutes (NODES_STARTED_EVENT_TIMEOUT_MINUTES, env-var overridable) to fix relay timeout in one-shot deploy
  • Increases MirrorNodeDeployed event wait timeout to 10 minutes (MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES, env-var overridable)
  • Increases node-add-local and separate-node-add E2E test timeouts to fix intermittent CI timeout failures
  • Fixes one-shot-local-build example: skip helm dependency build entirely when chart tarballs are already cached, avoiding Docker Hub unauthenticated 429 rate-limit errors on cache-hit runs
  • Adds docker.io/bitnami/postgresql:latest to the solo image cache target list to prevent ImagePullBackOff on Windows

Related Issues

Pull request (PR) checklist

  • This PR added tests (unit, integration, and/or end-to-end)
  • This PR updated documentation
  • This PR added no TODOs or commented out code
  • This PR has no breaking changes
  • Any technical debt has been documented as a separate issue and linked to this PR
  • Any package.json changes have been explained to and approved by a repository manager
  • All related issues have been linked to this PR
  • All changes in this PR are included in the description
  • When this PR merges the commits will be squashed and the title will be used as the commit message, the 'commit message guidelines' below have been followed

Testing

  • This PR added unit tests
  • This PR added integration/end-to-end tests
  • These changes required manual testing that is documented below
  • Anything not tested is documented

The following manual testing was done:

  • Unit tests for GitHubApiClient (9 tests: auth header injection, 403/429 retry with backoff, 404 passthrough, non-auth requests)
  • task build passes cleanly (0 errors)

The following was not tested locally (relies on CI):

  • Windows pinger fix: requires Windows Kind/WSL2 environment to reproduce the 13-minute FQDN TCP hang
  • one-shot-local-build helm cache skip on cache-miss path (first run still contacts Docker Hub)

Intermittent HTTP 403 errors from the GitHub API on Windows runners were
caused by two compounding issues:

1. EdgeVersionFetcher never included an Authorization header, leaving all
   five component-version lookups unauthenticated (60 req/hour shared per
   IP on GitHub-hosted runners).
2. No retry logic existed for transient 403/429 rate-limit responses, so
   a single bad response permanently failed the dependency-check step.

Introduce GitHubApiClient, a static utility class that:
- Adds a Bearer token from GITHUB_TOKEN when present (raises limit to
  5 000 req/hour for authenticated requests).
- Retries on HTTP 403 and 429 with exponential backoff (up to 3 attempts),
  honouring the Retry-After and X-RateLimit-Reset response headers.

Wire GitHubApiClient.get() into EdgeVersionFetcher and all four dependency
managers (crane, gvproxy, podman, vfkit), replacing ~15 lines of duplicated
header-building + fetch code in each.

Add 9 unit tests for GitHubApiClient covering auth, retry, Retry-After
header parsing, exhaustion after max retries, no-retry on non-rate-limit
errors, and network-failure wrapping.

Fixes #4711

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas requested a review from a team as a code owner June 17, 2026 18:23
@trunk-io

trunk-io Bot commented Jun 17, 2026

Copy link
Copy Markdown

😎 Merged manually by @jeromy-cannon - details.

@JeffreyDallas JeffreyDallas self-assigned this Jun 17, 2026
@JeffreyDallas JeffreyDallas added P1-💎 Current Milestone & Goals PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 17, 2026
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Linux

38 tests  ±0   38 ✅ ±0   0s ⏱️ ±0s
17 suites ±0    0 💤 ±0 
 1 files   ±0    0 ❌ ±0 

Results for commit aadb62d. ± Comparison against base commit 1c0d381.

♻️ This comment has been updated with latest results.

@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Unit Test Results - Windows

    1 files  ±0    334 suites  +2   9s ⏱️ -1s
1 049 tests +9  1 049 ✅ +9  0 💤 ±0  0 ❌ ±0 
1 053 runs  +9  1 053 ✅ +9  0 💤 ±0  0 ❌ ±0 

Results for commit aadb62d. ± Comparison against base commit 1c0d381.

♻️ This comment has been updated with latest results.

JeffreyDallas and others added 2 commits June 17, 2026 14:09
On Windows runners using WSL2, the mirror-node pinger pod takes longer
to become ready than the default 300×2 s = 10 minutes because it must:
1. pass its own startup probe (/tmp/alive file),
2. connect to the Mirror REST API (http://mirror-1-restjava:80),
3. submit a transaction through the consensus network, and
4. verify that transaction was ingested by the mirror importer.

All of these steps are slower under WSL2 due to network-layer indirection,
and the generic PODS_READY_MAX_ATTEMPTS budget (shared with every other
pod check) is often exhausted just as the pinger is about to go Ready.

Add MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS (default 450) and
MIRROR_NODE_PINGER_PODS_READY_DELAY (default 2 000 ms), giving pinger
checks a 15-minute budget — consistent with relay and block-node —
while leaving every other mirror-node pod check at the existing 10-minute
limit.  Both constants are overridable via environment variables.

Observed failure: https://github.com/hiero-ledger/solo/actions/runs/27705538841/job/81956787144?pr=4703

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
The relay's withWaitCondition for NodesStarted was set to 10 minutes,
but the node start sequence emits NodesStarted only after the full chain
completes (including waitForTss), which can take 15-25+ minutes on slow
or busy runners. Both timeouts are now env-var overridable constants:
NODES_STARTED_EVENT_TIMEOUT_MINUTES (default 30) and
MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES (default 10).

Fixes #4714

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@github-actions

github-actions Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

E2E Test Report

 10 files  ±0   94 suites  ±0   1h 23m 58s ⏱️ +9s
301 tests ±0  301 ✅ ±0  0 💤 ±0  0 ❌ ±0 
320 runs  ±0  320 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit aadb62d. ± Comparison against base commit 1c0d381.

♻️ This comment has been updated with latest results.

JeffreyDallas and others added 3 commits June 17, 2026 15:21
The describe block in separate-node-add.test.ts had a 3-minute default
timeout that matched the intermittent failure reported in #4715 (Mocha
applies the describe-level timeout when the describe callback is async).
The "should add a new node to the network successfully" test runs three
sequential prepare/submit/execute commands that can take 10-15+ minutes
on slow CI runners, but only had a 12-minute timeout.

- describe block default: 3 min → 20 min
- "should add a new node" test: 12 min → 20 min
- outer describe in node-add-local: 3 min → 30 min

Fixes #4715

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Windows/WSL2 runners need more time for the mirror pinger pod to become
ready during concurrent one-shot deploys. Previous 15-min limit (450
attempts × 2000ms) was insufficient; bumped to 900 attempts = 30 min.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… 429

When helm dependency build runs for OCI chart repos (registry-1.docker.io/
bitnamicharts), it contacts Docker Hub for manifest verification even when
the tarball already exists in charts/, triggering unauthenticated rate-limit
429 errors. Fix: only run helm dependency build on cache miss; on cache hit,
restore tarballs and skip the build entirely. The cache key is the version.ts
hash, so the same key guarantees identical chart versions and tarballs.

Fixes #4721

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas requested a review from a team as a code owner June 18, 2026 02:32
Comment thread src/core/github-api-client.ts Outdated
@jan-milenkov jan-milenkov added the PR: Unresolved Comments A pull request where there are comments and they need to be resolved. label Jun 18, 2026
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas removed the PR: Unresolved Comments A pull request where there are comments and they need to be resolved. label Jun 18, 2026
JeffreyDallas and others added 5 commits June 18, 2026 10:00
…lBackOff

On Windows/WSL2 runners, the solo-shared-resources postgresql pod fails with
ImagePullBackOff because bitnami/postgresql:latest is not pre-cached into
the Kind cluster. This causes a cascade: PostgreSQL never starts → mirror
REST health check fails → mirror pinger can never become ready. Adding the
image to solo-cache-images-target.yaml ensures it is pre-loaded before
the deploy, avoiding Docker Hub unauthenticated rate-limit 429s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
…record stall

Without a block node, CN defaults to FILE_AND_GRPC writerMode which fills the
gRPC buffer (maxBlocks=5) and stalls record file production after ~20s. Mirror
importer falls behind, pinger can never confirm transactions, and the one-shot
deploy times out. Fix profile-manager to explicitly set FILE_ONLY when no block
nodes are in the deployment state.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… node

FILE_ONLY is not a valid BlockStreamWriterMode enum value in CN v0.74; the
correct value is FILE. Using FILE_ONLY caused CN to fail to become ACTIVE
across all E2E tests.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Reverts 242d1cc and f9e70fd. The writerMode changes are not needed
for the Windows pinger fix and are being removed.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
jan-milenkov
jan-milenkov previously approved these changes Jun 19, 2026
jeromy-cannon
jeromy-cannon previously approved these changes Jun 19, 2026
jeromy-cannon and others added 2 commits June 19, 2026 16:51
Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
@JeffreyDallas JeffreyDallas dismissed stale reviews from jeromy-cannon and jan-milenkov via aadb62d June 19, 2026 16:15
@JeffreyDallas JeffreyDallas added P0-🔥 ASAP and removed P1-💎 Current Milestone & Goals labels Jun 19, 2026
@jan-milenkov jan-milenkov added PR: Ready to Merge A pull request that is ready to merge. and removed PR: Needs Team Approval A pull request that needs review from a team member. labels Jun 22, 2026
@jeromy-cannon jeromy-cannon merged commit 1f63fc5 into main Jun 22, 2026
56 of 58 checks passed
@jeromy-cannon jeromy-cannon deleted the fix/github-api-rate-limit-retry branch June 22, 2026 11:00
swirlds-automation added a commit that referenced this pull request Jun 23, 2026
## [0.79.0](v0.78.0...v0.79.0) (2026-06-23)

### Features

* disable minio for CN >= 0.74.0 ([#4511](#4511)) ([e8a8c90](e8a8c90))

### Bug Fixes

* add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429 ([#4712](#4712)) ([1f63fc5](1f63fc5))
* delay one-shot mirror pinger deployment ([#4762](#4762)) ([5d787e3](5d787e3))
* generate error docs in solo and upload as release artifact ([#4750](#4750)) ([526ee3a](526ee3a))
* **lock:** treat suspended holders as lost so re-runs can reclaim ([#4663](#4663)) ([68289cf](68289cf))
* lower block memory footprint & fix migration from CN. 0.73 to CN 0.74 ([#4678](#4678)) ([de23795](de23795))
* mount small-memory patches directory from SOLO_CACHE staging ([#4756](#4756)) ([e02a97f](e02a97f))
* rework account creation idempotency guard ([#4728](#4728)) ([1439ef3](1439ef3))
* tolerate Helm OCI status output ([#4652](#4652)) ([733d052](733d052))
* update TSS wraps artifacts path to data/keys subdirectory ([#4662](#4662)) ([41e9377](41e9377))
@swirlds-automation

Copy link
Copy Markdown
Contributor

🎉 This PR is included in version 0.79.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

P0-🔥 ASAP PR: Ready to Merge A pull request that is ready to merge. released

Projects

None yet

4 participants