fix(client): close gateway-route race with data-plane warmup + verified-404-retry by websterbei · Pull Request #348 · fw-ai/cookbook

websterbei · 2026-04-17T02:38:59Z

Context

Counter-proposal to #344. That PR addresses the same crash (tinker.NotFoundError on a freshly-RUNNING trainer) by wrapping every training call in a 6-attempt / ~90 s exponential-backoff retry on any 404.

This PR fixes the underlying race instead of papering over its symptom, so that:

Most jobs never see a 404 in the first place.
When a 404 does happen, we can tell whether it is the gateway-routing race or a genuinely-deleted job, and act accordingly.

Root cause (recap)

The orchestrator talks to trainers via the Fireworks API gateway, which proxies /training/v1/rlorTrainerJobs/{accountId}/{jobId}/* to a pod by looking up (accountId, jobId) -> route in a DynamoDB-backed table.

TrainerJobManager.wait_for_existing() returns "ready" as soon as:

the control plane reports JOB_STATE_RUNNING, and
a single GET /api/v1/healthz through the gateway returns 200.

That is not enough. The DynamoDB route entry visible to the generic request path can still be stale across gateway replicas / connection-pool entries for a few seconds after RUNNING. The first real call then 404s; tinker correctly does not retry 404 (treats it as permanent in pure tinker semantics); the orchestrator crashes — even though the trainer pod is healthy and would have served the request seconds later.

See the discussion on #344 for the full trace through tinker/lib/api_future_impl.py, internal_client_holder.py, and fireworks/training/sdk/trainer.py:_get_trainer_gateway_url.

Fix

Two narrowly-scoped changes in training/utils/client.py:

1. Data-plane warmup at connect time

_wait_for_data_plane_ready() runs after wait_for_existing returns. It issues get_info() calls (which traverse the same gateway → trainer route as forward / forward_backward) until it sees _WARMUP_REQUIRED_SUCCESSES = 3 consecutive successes. 404s during warmup are expected and retried with bounded backoff; non-404 errors short-circuit the warmup so we don't silently mask a real trainer failure. Hard timeout of 120 s.

This eliminates the "freshly-RUNNING window" deterministically. Most jobs will never observe a 404 at the orchestrator after this lands.

2. Verified retry on transient 404 at request time

_retry_on_transient_not_found() wraps the four training calls (forward, forward_backward, forward_backward_custom, optim_step). On a 404 it queries the control plane:

state == JOB_STATE_RUNNING → routing race, retry with bounded backoff (_NOT_FOUND_MAX_RETRIES = 4, ~30 s worst case).
anything else (deleted / failed / paused) → re-raise the 404 immediately so the orchestrator can fail fast and resume from DCP, instead of burning the full retry budget.
control-plane probe itself raises → also re-raise the 404 (don't mask compounded failures).

Comparison with #344

	#344	This PR
404 on first call after RUNNING	retried (still crashes loop)	prevented at connect time
404 because job was deleted	~90 s of pointless backoff	fail-fast (immediate)
404 mid-training, job still RUNNING	retried (~90 s budget)	retried (~30 s budget; warmup already ruled out cold-start)
Distinguishes routing race vs. real not-found	no	yes (per-attempt control-plane probe)
Net training-time impact (happy path)	none	one-off ~3 s warmup at connect
Layer of fix	symptom (request)	underlying readiness contract (connect)

Note on the "real" fix

This is still a client-side mitigation. The fully correct fix lives in the gateway:

For jobs whose control-plane state is RUNNING, retry the DynamoDB route lookup internally before returning 404, or
return a retriable status (e.g. 503 with queue_state: route_not_ready) so tinker's existing 408/5xx retry loop handles it transparently, or
disambiguate "route stale" (transient) from "trainer truly gone" (terminal) on the gateway side.

Once any of those land upstream, the warmup + verified retry in this PR can be deleted.

Tests

12 new unit tests in training/tests/unit/test_client.py:

Warmup

test_warmup_returns_after_required_consecutive_successes
test_warmup_recovers_after_transient_404s
test_warmup_raises_on_persistent_404_after_timeout
test_warmup_skips_on_non_404_error

Verified retry

test_retry_succeeds_immediately_when_no_404
test_retry_recovers_after_transient_404_when_still_running
test_retry_fails_fast_when_job_no_longer_running
test_retry_fails_fast_when_control_plane_probe_errors
test_retry_exhausts_when_404_persists
test_retry_propagates_non_404_immediately

End-to-end through forward_backward

test_forward_backward_retries_only_when_running
test_forward_backward_does_not_retry_when_job_deleted

All 16 tests in training/tests/unit/test_client.py pass. The remaining unit-suite failures in this environment are pre-existing (math_verify not installed) and unrelated to this change.

Slack Thread

…ed-404-retry Counter-proposal to #344: instead of a 90s blanket retry on every 404, address the actual race in two narrowly-scoped places. Root cause (see PR #344 thread for the full trace): 1. Orchestrator talks to trainers via the Fireworks API gateway, which looks up (account_id, job_id) -> pod route in a DynamoDB-backed table. 2. TrainerJobManager.wait_for_existing returns 'ready' as soon as the control plane reports JOB_STATE_RUNNING and a single /api/v1/healthz call through the gateway succeeds. The gateway's generic request-path route can still be stale at that point (read-after-write inconsistency across gateway replicas / connection-pool entries). 3. The first forward_backward then 404s. Tinker correctly does not retry 404 (treats it as permanent), and the orchestrator crashes. Proper fix in this PR: * _wait_for_data_plane_ready (called from _use_endpoint): after the SDK hands us the endpoint, send N consecutive get_info() probes through the same gateway path forward_backward will use. Only return after we see the route is globally stable. This eliminates the 'freshly-RUNNING' window deterministically rather than papering over its symptom. * _retry_on_transient_not_found (used by forward / forward_backward / forward_backward_custom / optim_step): if a 404 still surfaces, query the control plane. Retry only if state is still JOB_STATE_RUNNING -- otherwise re-raise immediately so a truly-deleted/failed/paused job fails fast instead of burning the full retry budget. Compared to PR #344: - Most jobs never see a 404 at all (warmup eats it at connect time). - True not-founds now propagate immediately (no ~90s wait). - Tighter retry budget when a 404 does happen mid-training (4 retries, ~30s worst-case) since warmup already ruled out the cold-start window. - Per-attempt control-plane probe makes failure modes diagnosable (logs distinguish 'gateway routing race' from 'job genuinely gone'). Tests: - 12 new unit tests cover both the warmup loop (success after N probes, recovery from transient 404s, timeout on persistent 404, non-404 surfacing) and the verified retry (success, recovery, fast-fail when job not RUNNING, fast-fail when control-plane probe errors, exhaustion, non-404 propagation, end-to-end through forward_backward). - All 16 tests in training/tests/unit/test_client.py pass. Note: this is still a client-side mitigation. The fully proper fix lives in the gateway (retry the DynamoDB lookup internally for jobs whose control-plane state is RUNNING, or return a retriable 503 instead of 404 during the warm-up window). The cookbook side should be removed once that ships. Co-authored-by: Yufei (Benny) Chen <benjibc@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(client): close gateway-route race with data-plane warmup + verified-404-retry#348

fix(client): close gateway-route race with data-plane warmup + verified-404-retry#348
websterbei wants to merge 1 commit intomainfrom
cursor/trainer-route-warmup-c137

websterbei commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

websterbei commented Apr 17, 2026

Context

Root cause (recap)

Fix

1. Data-plane warmup at connect time

2. Verified retry on transient 404 at request time

Comparison with #344

Note on the "real" fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants