fix: add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429#4712
Merged
Conversation
Intermittent HTTP 403 errors from the GitHub API on Windows runners were caused by two compounding issues: 1. EdgeVersionFetcher never included an Authorization header, leaving all five component-version lookups unauthenticated (60 req/hour shared per IP on GitHub-hosted runners). 2. No retry logic existed for transient 403/429 rate-limit responses, so a single bad response permanently failed the dependency-check step. Introduce GitHubApiClient, a static utility class that: - Adds a Bearer token from GITHUB_TOKEN when present (raises limit to 5 000 req/hour for authenticated requests). - Retries on HTTP 403 and 429 with exponential backoff (up to 3 attempts), honouring the Retry-After and X-RateLimit-Reset response headers. Wire GitHubApiClient.get() into EdgeVersionFetcher and all four dependency managers (crane, gvproxy, podman, vfkit), replacing ~15 lines of duplicated header-building + fetch code in each. Add 9 unit tests for GitHubApiClient covering auth, retry, Retry-After header parsing, exhaustion after max retries, no-retry on non-rate-limit errors, and network-failure wrapping. Fixes #4711 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
|
😎 Merged manually by @jeromy-cannon - details. |
Contributor
Contributor
On Windows runners using WSL2, the mirror-node pinger pod takes longer to become ready than the default 300×2 s = 10 minutes because it must: 1. pass its own startup probe (/tmp/alive file), 2. connect to the Mirror REST API (http://mirror-1-restjava:80), 3. submit a transaction through the consensus network, and 4. verify that transaction was ingested by the mirror importer. All of these steps are slower under WSL2 due to network-layer indirection, and the generic PODS_READY_MAX_ATTEMPTS budget (shared with every other pod check) is often exhausted just as the pinger is about to go Ready. Add MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS (default 450) and MIRROR_NODE_PINGER_PODS_READY_DELAY (default 2 000 ms), giving pinger checks a 15-minute budget — consistent with relay and block-node — while leaving every other mirror-node pod check at the existing 10-minute limit. Both constants are overridable via environment variables. Observed failure: https://github.com/hiero-ledger/solo/actions/runs/27705538841/job/81956787144?pr=4703 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
The relay's withWaitCondition for NodesStarted was set to 10 minutes, but the node start sequence emits NodesStarted only after the full chain completes (including waitForTss), which can take 15-25+ minutes on slow or busy runners. Both timeouts are now env-var overridable constants: NODES_STARTED_EVENT_TIMEOUT_MINUTES (default 30) and MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES (default 10). Fixes #4714 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Contributor
The describe block in separate-node-add.test.ts had a 3-minute default timeout that matched the intermittent failure reported in #4715 (Mocha applies the describe-level timeout when the describe callback is async). The "should add a new node to the network successfully" test runs three sequential prepare/submit/execute commands that can take 10-15+ minutes on slow CI runners, but only had a 12-minute timeout. - describe block default: 3 min → 20 min - "should add a new node" test: 12 min → 20 min - outer describe in node-add-local: 3 min → 30 min Fixes #4715 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
Windows/WSL2 runners need more time for the mirror pinger pod to become ready during concurrent one-shot deploys. Previous 15-min limit (450 attempts × 2000ms) was insufficient; bumped to 900 attempts = 30 min. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… 429 When helm dependency build runs for OCI chart repos (registry-1.docker.io/ bitnamicharts), it contacts Docker Hub for manifest verification even when the tarball already exists in charts/, triggering unauthenticated rate-limit 429 errors. Fix: only run helm dependency build on cache miss; on cache hit, restore tarballs and skip the build entirely. The cache key is the version.ts hash, so the same key guarantees identical chart versions and tarballs. Fixes #4721 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
…lBackOff On Windows/WSL2 runners, the solo-shared-resources postgresql pod fails with ImagePullBackOff because bitnami/postgresql:latest is not pre-cached into the Kind cluster. This causes a cascade: PostgreSQL never starts → mirror REST health check fails → mirror pinger can never become ready. Adding the image to solo-cache-images-target.yaml ensures it is pre-loaded before the deploy, avoiding Docker Hub unauthenticated rate-limit 429s. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
…record stall Without a block node, CN defaults to FILE_AND_GRPC writerMode which fills the gRPC buffer (maxBlocks=5) and stalls record file production after ~20s. Mirror importer falls behind, pinger can never confirm transactions, and the one-shot deploy times out. Fix profile-manager to explicitly set FILE_ONLY when no block nodes are in the deployment state. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
… node FILE_ONLY is not a valid BlockStreamWriterMode enum value in CN v0.74; the correct value is FILE. Using FILE_ONLY caused CN to fail to become ACTIVE across all E2E tests. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Jeffrey Tang <jeffrey@swirldslabs.com>
jan-milenkov
previously approved these changes
Jun 19, 2026
jeromy-cannon
previously approved these changes
Jun 19, 2026
aadb62d
jan-milenkov
approved these changes
Jun 21, 2026
This was referenced Jun 22, 2026
swirlds-automation
added a commit
that referenced
this pull request
Jun 23, 2026
## [0.79.0](v0.78.0...v0.79.0) (2026-06-23) ### Features * disable minio for CN >= 0.74.0 ([#4511](#4511)) ([e8a8c90](e8a8c90)) ### Bug Fixes * add GitHubApiClient with GITHUB_TOKEN auth and retry on 403/429 ([#4712](#4712)) ([1f63fc5](1f63fc5)) * delay one-shot mirror pinger deployment ([#4762](#4762)) ([5d787e3](5d787e3)) * generate error docs in solo and upload as release artifact ([#4750](#4750)) ([526ee3a](526ee3a)) * **lock:** treat suspended holders as lost so re-runs can reclaim ([#4663](#4663)) ([68289cf](68289cf)) * lower block memory footprint & fix migration from CN. 0.73 to CN 0.74 ([#4678](#4678)) ([de23795](de23795)) * mount small-memory patches directory from SOLO_CACHE staging ([#4756](#4756)) ([e02a97f](e02a97f)) * rework account creation idempotency guard ([#4728](#4728)) ([1439ef3](1439ef3)) * tolerate Helm OCI status output ([#4652](#4652)) ([733d052](733d052)) * update TSS wraps artifacts path to data/keys subdirectory ([#4662](#4662)) ([41e9377](41e9377))
Contributor
|
🎉 This PR is included in version 0.79.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This pull request changes the following:
GitHubApiClient(src/core/github-api-client.ts) withGITHUB_TOKENauth and automatic retry on 403/429 responses, fixing GitHub API rate-limit failures in CI dependency managerssetServiceEndpoints(ClusterIP) to the existingNodeUpdateTransactioninsetGrpcWebEndpoint, so the mirror node address book carries a routable IP instead of the bootstrap FQDN. hiero-sdk-go v2.80.0 introduced eager gRPC dialing atClientForNetworkstartup; on Windows Kind/WSL2 the FQDN TCP dial hangs ~13 min (kernel retransmit timeout), blocking pinger readiness. After the NodeUpdate, the importer writes the ClusterIP to file 0.0.102 and pinger connects immediately on its next restart.MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS: 450 → 900 × 2 s, env-var overridable) to accommodate image-load overhead on Windows runners introduced byebe4534e1MIRROR_NODE_PINGER_PODS_READY_MAX_ATTEMPTS,MIRROR_NODE_PINGER_PODS_READY_DELAY) so pinger wait can be tuned independently of other pod readiness checksNodesStartedevent wait timeout to 30 minutes (NODES_STARTED_EVENT_TIMEOUT_MINUTES, env-var overridable) to fix relay timeout in one-shot deployMirrorNodeDeployedevent wait timeout to 10 minutes (MIRROR_NODE_DEPLOYED_EVENT_TIMEOUT_MINUTES, env-var overridable)node-add-localandseparate-node-addE2E test timeouts to fix intermittent CI timeout failuresone-shot-local-buildexample: skiphelm dependency buildentirely when chart tarballs are already cached, avoiding Docker Hub unauthenticated 429 rate-limit errors on cache-hit runsdocker.io/bitnami/postgresql:latestto the solo image cache target list to preventImagePullBackOffon WindowsRelated Issues
Pull request (PR) checklist
package.jsonchanges have been explained to and approved by a repository managerTesting
The following manual testing was done:
GitHubApiClient(9 tests: auth header injection, 403/429 retry with backoff, 404 passthrough, non-auth requests)task buildpasses cleanly (0 errors)The following was not tested locally (relies on CI):
one-shot-local-buildhelm cache skip on cache-miss path (first run still contacts Docker Hub)