fix(client): close gateway-route race with data-plane warmup + verified-404-retry#348
Draft
websterbei wants to merge 1 commit intomainfrom
Draft
fix(client): close gateway-route race with data-plane warmup + verified-404-retry#348websterbei wants to merge 1 commit intomainfrom
websterbei wants to merge 1 commit intomainfrom
Conversation
…ed-404-retry Counter-proposal to #344: instead of a 90s blanket retry on every 404, address the actual race in two narrowly-scoped places. Root cause (see PR #344 thread for the full trace): 1. Orchestrator talks to trainers via the Fireworks API gateway, which looks up (account_id, job_id) -> pod route in a DynamoDB-backed table. 2. TrainerJobManager.wait_for_existing returns 'ready' as soon as the control plane reports JOB_STATE_RUNNING and a single /api/v1/healthz call through the gateway succeeds. The gateway's generic request-path route can still be stale at that point (read-after-write inconsistency across gateway replicas / connection-pool entries). 3. The first forward_backward then 404s. Tinker correctly does not retry 404 (treats it as permanent), and the orchestrator crashes. Proper fix in this PR: * _wait_for_data_plane_ready (called from _use_endpoint): after the SDK hands us the endpoint, send N consecutive get_info() probes through the same gateway path forward_backward will use. Only return after we see the route is globally stable. This eliminates the 'freshly-RUNNING' window deterministically rather than papering over its symptom. * _retry_on_transient_not_found (used by forward / forward_backward / forward_backward_custom / optim_step): if a 404 still surfaces, query the control plane. Retry only if state is still JOB_STATE_RUNNING -- otherwise re-raise immediately so a truly-deleted/failed/paused job fails fast instead of burning the full retry budget. Compared to PR #344: - Most jobs never see a 404 at all (warmup eats it at connect time). - True not-founds now propagate immediately (no ~90s wait). - Tighter retry budget when a 404 does happen mid-training (4 retries, ~30s worst-case) since warmup already ruled out the cold-start window. - Per-attempt control-plane probe makes failure modes diagnosable (logs distinguish 'gateway routing race' from 'job genuinely gone'). Tests: - 12 new unit tests cover both the warmup loop (success after N probes, recovery from transient 404s, timeout on persistent 404, non-404 surfacing) and the verified retry (success, recovery, fast-fail when job not RUNNING, fast-fail when control-plane probe errors, exhaustion, non-404 propagation, end-to-end through forward_backward). - All 16 tests in training/tests/unit/test_client.py pass. Note: this is still a client-side mitigation. The fully proper fix lives in the gateway (retry the DynamoDB lookup internally for jobs whose control-plane state is RUNNING, or return a retriable 503 instead of 404 during the warm-up window). The cookbook side should be removed once that ships. Co-authored-by: Yufei (Benny) Chen <benjibc@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
Counter-proposal to #344. That PR addresses the same crash (
tinker.NotFoundErroron a freshly-RUNNING trainer) by wrapping every training call in a 6-attempt / ~90 s exponential-backoff retry on any 404.This PR fixes the underlying race instead of papering over its symptom, so that:
Root cause (recap)
The orchestrator talks to trainers via the Fireworks API gateway, which proxies
/training/v1/rlorTrainerJobs/{accountId}/{jobId}/*to a pod by looking up(accountId, jobId) -> routein a DynamoDB-backed table.TrainerJobManager.wait_for_existing()returns "ready" as soon as:JOB_STATE_RUNNING, andGET /api/v1/healthzthrough the gateway returns 200.That is not enough. The DynamoDB route entry visible to the generic request path can still be stale across gateway replicas / connection-pool entries for a few seconds after
RUNNING. The first real call then 404s; tinker correctly does not retry 404 (treats it as permanent in pure tinker semantics); the orchestrator crashes — even though the trainer pod is healthy and would have served the request seconds later.See the discussion on #344 for the full trace through
tinker/lib/api_future_impl.py,internal_client_holder.py, andfireworks/training/sdk/trainer.py:_get_trainer_gateway_url.Fix
Two narrowly-scoped changes in
training/utils/client.py:1. Data-plane warmup at connect time
_wait_for_data_plane_ready()runs afterwait_for_existingreturns. It issuesget_info()calls (which traverse the same gateway → trainer route asforward/forward_backward) until it sees_WARMUP_REQUIRED_SUCCESSES = 3consecutive successes. 404s during warmup are expected and retried with bounded backoff; non-404 errors short-circuit the warmup so we don't silently mask a real trainer failure. Hard timeout of 120 s.This eliminates the "freshly-RUNNING window" deterministically. Most jobs will never observe a 404 at the orchestrator after this lands.
2. Verified retry on transient 404 at request time
_retry_on_transient_not_found()wraps the four training calls (forward,forward_backward,forward_backward_custom,optim_step). On a 404 it queries the control plane:state == JOB_STATE_RUNNING→ routing race, retry with bounded backoff (_NOT_FOUND_MAX_RETRIES = 4, ~30 s worst case).Comparison with #344
Note on the "real" fix
This is still a client-side mitigation. The fully correct fix lives in the gateway:
RUNNING, retry the DynamoDB route lookup internally before returning 404, orqueue_state: route_not_ready) so tinker's existing 408/5xx retry loop handles it transparently, orOnce any of those land upstream, the warmup + verified retry in this PR can be deleted.
Tests
12 new unit tests in
training/tests/unit/test_client.py:Warmup
test_warmup_returns_after_required_consecutive_successestest_warmup_recovers_after_transient_404stest_warmup_raises_on_persistent_404_after_timeouttest_warmup_skips_on_non_404_errorVerified retry
test_retry_succeeds_immediately_when_no_404test_retry_recovers_after_transient_404_when_still_runningtest_retry_fails_fast_when_job_no_longer_runningtest_retry_fails_fast_when_control_plane_probe_errorstest_retry_exhausts_when_404_persiststest_retry_propagates_non_404_immediatelyEnd-to-end through
forward_backwardtest_forward_backward_retries_only_when_runningtest_forward_backward_does_not_retry_when_job_deletedAll 16 tests in
training/tests/unit/test_client.pypass. The remaining unit-suite failures in this environment are pre-existing (math_verifynot installed) and unrelated to this change.Slack Thread