Skip to content

fix(infra): strip pinned /versions/ suffix from shape IDs and enrich 400 error messages#392

Draft
andrefoo wants to merge 1 commit intomainfrom
cursor/fix-training-shape-validation-1cdf
Draft

fix(infra): strip pinned /versions/ suffix from shape IDs and enrich 400 error messages#392
andrefoo wants to merge 1 commit intomainfrom
cursor/fix-training-shape-validation-1cdf

Conversation

@andrefoo
Copy link
Copy Markdown

Problem

Clients are hitting HTTP 400 no validated training shape exists errors in two scenarios:

  1. Pinned /versions/<id> suffixes on explicit shape IDs — these can reference stale or unvalidated snapshots. The platform auto-selects the latest validated version when given a bare shape path, but clients don't know this and pin old versions that break.

  2. Opaque server errors — when the 400 comes back, the error message from the server mentions "no validated training shape exists" but provides no actionable guidance about why or how to fix it. This is especially confusing for custom (non-Fireworks) base models and LoRA DPO where the shape/mode interaction is subtle.

Example client error (LoRA DPO with custom model):

ERROR: RLOR job creation failed (HTTP 400)
  Cause: no validated training shape exists for
    training_shape=accounts/fireworks/trainingShapes/qwen3p5-35b-a3b-256k-lora/versions/bbmrqbzh
    base_model=accounts/wix/models/c2s-merged-v4
    trainer_mode=FORWARD_ONLY

Changes

training/utils/infra.py

  • _strip_version_suffix(): Automatically strips /versions/<id> from explicit training_shape_id and ref_training_shape_id with a WARNING log, letting the platform auto-select the latest validated version. Applied in both _resolve_policy_shape and _resolve_reference_shape.

  • _shape_error_hint(): Parses the server's "no validated training shape" error and returns actionable hints appended to the RuntimeError:

    • Pinned version detected → tells client to use bare shape path
    • FORWARD_ONLY + lora_rank > 0 mismatch → suggests the shared-session reference (policy.create_base_reference())
    • Custom (non-Fireworks) base model → explains shape registration requirements
    • Generic fallback → suggests auto-selection

Tests

9 new tests across test_infra_setup.py and test_shape_override_paths.py:

  • Version stripping for policy shape, ref shape, and bare (no-op) path
  • Error hint for pinned version, FORWARD_ONLY/LoRA mismatch, custom model, generic fallback, unrelated errors, and integration with request_trainer_job

All 505 unit tests pass.

Slack Thread

Open in Web Open in Cursor 

…400 error messages

Clients hitting HTTP 400 'no validated training shape exists' errors when:
1. Using explicit shape IDs with pinned /versions/<id> suffixes, which
   can reference stale or unvalidated snapshots.
2. Using shapes that don't match their base model, with no guidance on
   what went wrong.

Changes:
- _strip_version_suffix: automatically strips /versions/<id> from
  explicit training_shape_id and ref_training_shape_id with a warning,
  letting the platform auto-select the latest validated version.
- _shape_error_hint: parses the server's 'no validated training shape'
  error and produces actionable hints:
  * Pinned version detected -> tell client to use bare shape path
  * FORWARD_ONLY + lora_rank>0 mismatch -> suggest shared-session ref
  * Custom (non-Fireworks) base model -> explain shape registration
  * Generic fallback -> suggest auto-selection

Tests: 9 new tests covering version stripping (policy, ref, bare path)
and all error hint branches.

Motivated by client errors with qwen3-coder-30b-a3b / qwen3p5-35b shapes.

Co-authored-by: Andre Foo <andrefoo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants