fix(infra): strip pinned /versions/ suffix from shape IDs and enrich 400 error messages#392
Draft
fix(infra): strip pinned /versions/ suffix from shape IDs and enrich 400 error messages#392
Conversation
…400 error messages Clients hitting HTTP 400 'no validated training shape exists' errors when: 1. Using explicit shape IDs with pinned /versions/<id> suffixes, which can reference stale or unvalidated snapshots. 2. Using shapes that don't match their base model, with no guidance on what went wrong. Changes: - _strip_version_suffix: automatically strips /versions/<id> from explicit training_shape_id and ref_training_shape_id with a warning, letting the platform auto-select the latest validated version. - _shape_error_hint: parses the server's 'no validated training shape' error and produces actionable hints: * Pinned version detected -> tell client to use bare shape path * FORWARD_ONLY + lora_rank>0 mismatch -> suggest shared-session ref * Custom (non-Fireworks) base model -> explain shape registration * Generic fallback -> suggest auto-selection Tests: 9 new tests covering version stripping (policy, ref, bare path) and all error hint branches. Motivated by client errors with qwen3-coder-30b-a3b / qwen3p5-35b shapes. Co-authored-by: Andre Foo <andrefoo@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Clients are hitting HTTP 400
no validated training shape existserrors in two scenarios:Pinned
/versions/<id>suffixes on explicit shape IDs — these can reference stale or unvalidated snapshots. The platform auto-selects the latest validated version when given a bare shape path, but clients don't know this and pin old versions that break.Opaque server errors — when the 400 comes back, the error message from the server mentions "no validated training shape exists" but provides no actionable guidance about why or how to fix it. This is especially confusing for custom (non-Fireworks) base models and LoRA DPO where the shape/mode interaction is subtle.
Example client error (LoRA DPO with custom model):
Changes
training/utils/infra.py_strip_version_suffix(): Automatically strips/versions/<id>from explicittraining_shape_idandref_training_shape_idwith aWARNINGlog, letting the platform auto-select the latest validated version. Applied in both_resolve_policy_shapeand_resolve_reference_shape._shape_error_hint(): Parses the server's "no validated training shape" error and returns actionable hints appended to theRuntimeError:FORWARD_ONLY+lora_rank > 0mismatch → suggests the shared-session reference (policy.create_base_reference())Tests
9 new tests across
test_infra_setup.pyandtest_shape_override_paths.py:request_trainer_jobAll 505 unit tests pass.
Slack Thread