Skip to content

fix(client): close gateway-route race with data-plane warmup + verified-404-retry#348

Draft
websterbei wants to merge 1 commit intomainfrom
cursor/trainer-route-warmup-c137
Draft

fix(client): close gateway-route race with data-plane warmup + verified-404-retry#348
websterbei wants to merge 1 commit intomainfrom
cursor/trainer-route-warmup-c137

Conversation

@websterbei
Copy link
Copy Markdown
Contributor

Context

Counter-proposal to #344. That PR addresses the same crash (tinker.NotFoundError on a freshly-RUNNING trainer) by wrapping every training call in a 6-attempt / ~90 s exponential-backoff retry on any 404.

This PR fixes the underlying race instead of papering over its symptom, so that:

  1. Most jobs never see a 404 in the first place.
  2. When a 404 does happen, we can tell whether it is the gateway-routing race or a genuinely-deleted job, and act accordingly.

Root cause (recap)

The orchestrator talks to trainers via the Fireworks API gateway, which proxies /training/v1/rlorTrainerJobs/{accountId}/{jobId}/* to a pod by looking up (accountId, jobId) -> route in a DynamoDB-backed table.

TrainerJobManager.wait_for_existing() returns "ready" as soon as:

  • the control plane reports JOB_STATE_RUNNING, and
  • a single GET /api/v1/healthz through the gateway returns 200.

That is not enough. The DynamoDB route entry visible to the generic request path can still be stale across gateway replicas / connection-pool entries for a few seconds after RUNNING. The first real call then 404s; tinker correctly does not retry 404 (treats it as permanent in pure tinker semantics); the orchestrator crashes — even though the trainer pod is healthy and would have served the request seconds later.

See the discussion on #344 for the full trace through tinker/lib/api_future_impl.py, internal_client_holder.py, and fireworks/training/sdk/trainer.py:_get_trainer_gateway_url.

Fix

Two narrowly-scoped changes in training/utils/client.py:

1. Data-plane warmup at connect time

_wait_for_data_plane_ready() runs after wait_for_existing returns. It issues get_info() calls (which traverse the same gateway → trainer route as forward / forward_backward) until it sees _WARMUP_REQUIRED_SUCCESSES = 3 consecutive successes. 404s during warmup are expected and retried with bounded backoff; non-404 errors short-circuit the warmup so we don't silently mask a real trainer failure. Hard timeout of 120 s.

This eliminates the "freshly-RUNNING window" deterministically. Most jobs will never observe a 404 at the orchestrator after this lands.

2. Verified retry on transient 404 at request time

_retry_on_transient_not_found() wraps the four training calls (forward, forward_backward, forward_backward_custom, optim_step). On a 404 it queries the control plane:

  • state == JOB_STATE_RUNNING → routing race, retry with bounded backoff (_NOT_FOUND_MAX_RETRIES = 4, ~30 s worst case).
  • anything else (deleted / failed / paused) → re-raise the 404 immediately so the orchestrator can fail fast and resume from DCP, instead of burning the full retry budget.
  • control-plane probe itself raises → also re-raise the 404 (don't mask compounded failures).

Comparison with #344

#344 This PR
404 on first call after RUNNING retried (still crashes loop) prevented at connect time
404 because job was deleted ~90 s of pointless backoff fail-fast (immediate)
404 mid-training, job still RUNNING retried (~90 s budget) retried (~30 s budget; warmup already ruled out cold-start)
Distinguishes routing race vs. real not-found no yes (per-attempt control-plane probe)
Net training-time impact (happy path) none one-off ~3 s warmup at connect
Layer of fix symptom (request) underlying readiness contract (connect)

Note on the "real" fix

This is still a client-side mitigation. The fully correct fix lives in the gateway:

  • For jobs whose control-plane state is RUNNING, retry the DynamoDB route lookup internally before returning 404, or
  • return a retriable status (e.g. 503 with queue_state: route_not_ready) so tinker's existing 408/5xx retry loop handles it transparently, or
  • disambiguate "route stale" (transient) from "trainer truly gone" (terminal) on the gateway side.

Once any of those land upstream, the warmup + verified retry in this PR can be deleted.

Tests

12 new unit tests in training/tests/unit/test_client.py:

Warmup

  • test_warmup_returns_after_required_consecutive_successes
  • test_warmup_recovers_after_transient_404s
  • test_warmup_raises_on_persistent_404_after_timeout
  • test_warmup_skips_on_non_404_error

Verified retry

  • test_retry_succeeds_immediately_when_no_404
  • test_retry_recovers_after_transient_404_when_still_running
  • test_retry_fails_fast_when_job_no_longer_running
  • test_retry_fails_fast_when_control_plane_probe_errors
  • test_retry_exhausts_when_404_persists
  • test_retry_propagates_non_404_immediately

End-to-end through forward_backward

  • test_forward_backward_retries_only_when_running
  • test_forward_backward_does_not_retry_when_job_deleted

All 16 tests in training/tests/unit/test_client.py pass. The remaining unit-suite failures in this environment are pre-existing (math_verify not installed) and unrelated to this change.

Slack Thread

Open in Web Open in Cursor 

…ed-404-retry

Counter-proposal to #344: instead of a 90s blanket retry on every 404, address
the actual race in two narrowly-scoped places.

Root cause (see PR #344 thread for the full trace):
  1. Orchestrator talks to trainers via the Fireworks API gateway, which
     looks up (account_id, job_id) -> pod route in a DynamoDB-backed table.
  2. TrainerJobManager.wait_for_existing returns 'ready' as soon as the
     control plane reports JOB_STATE_RUNNING and a single /api/v1/healthz
     call through the gateway succeeds. The gateway's generic request-path
     route can still be stale at that point (read-after-write inconsistency
     across gateway replicas / connection-pool entries).
  3. The first forward_backward then 404s. Tinker correctly does not retry
     404 (treats it as permanent), and the orchestrator crashes.

Proper fix in this PR:

* _wait_for_data_plane_ready (called from _use_endpoint): after the SDK
  hands us the endpoint, send N consecutive get_info() probes through the
  same gateway path forward_backward will use. Only return after we see
  the route is globally stable. This eliminates the 'freshly-RUNNING'
  window deterministically rather than papering over its symptom.

* _retry_on_transient_not_found (used by forward / forward_backward /
  forward_backward_custom / optim_step): if a 404 still surfaces, query
  the control plane. Retry only if state is still JOB_STATE_RUNNING --
  otherwise re-raise immediately so a truly-deleted/failed/paused job
  fails fast instead of burning the full retry budget.

Compared to PR #344:
  - Most jobs never see a 404 at all (warmup eats it at connect time).
  - True not-founds now propagate immediately (no ~90s wait).
  - Tighter retry budget when a 404 does happen mid-training (4 retries,
    ~30s worst-case) since warmup already ruled out the cold-start window.
  - Per-attempt control-plane probe makes failure modes diagnosable
    (logs distinguish 'gateway routing race' from 'job genuinely gone').

Tests:
  - 12 new unit tests cover both the warmup loop (success after N probes,
    recovery from transient 404s, timeout on persistent 404, non-404
    surfacing) and the verified retry (success, recovery, fast-fail when
    job not RUNNING, fast-fail when control-plane probe errors, exhaustion,
    non-404 propagation, end-to-end through forward_backward).
  - All 16 tests in training/tests/unit/test_client.py pass.

Note: this is still a client-side mitigation. The fully proper fix lives
in the gateway (retry the DynamoDB lookup internally for jobs whose
control-plane state is RUNNING, or return a retriable 503 instead of 404
during the warm-up window). The cookbook side should be removed once that
ships.

Co-authored-by: Yufei (Benny) Chen <benjibc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants