Skip to content

fix(training_shapes): handle 403/non-success model fetch gracefully in auto-select#337

Draft
websterbei wants to merge 1 commit intomainfrom
cursor/fix-training-shape-403-error-b042
Draft

fix(training_shapes): handle 403/non-success model fetch gracefully in auto-select#337
websterbei wants to merge 1 commit intomainfrom
cursor/fix-training-shape-403-error-b042

Conversation

@websterbei
Copy link
Copy Markdown
Contributor

Problem

Training job rft--pyroworks-dev-ftsy30v1-e19defec7 crashed with:

RuntimeError: Failed to fetch model details for 'accounts/fireworks/models/qwen3p5-397b-a17b' (HTTP 403)

This happens in auto_select_training_shape when:

  1. No exact base_model match is found in validated training shapes
  2. The fallback path calls GET /v1/accounts/fireworks/models/qwen3p5-397b-a17b to get model type and parameter count for a broader search
  3. That API call returns HTTP 403 (the model may be restricted or not yet public)

The RuntimeError is unrecoverable and provides no actionable guidance.

Fix

Catch the RuntimeError from _fetch_model_context, log a warning, and skip the parameter-count fallback. The function then falls through to raise a ValueError with the message "Provide an explicit training_shape_id", which tells the user exactly what to do.

This is the right behavior because:

  • If the exact match found nothing and the model details are inaccessible, there's nothing more the auto-select can do
  • A ValueError with an actionable message is more useful than a RuntimeError about an HTTP status code
  • The training job gets a clear error up front instead of a cryptic crash

Changes

  • training/utils/training_shapes.py: Wrap _fetch_model_context call in try/except to handle 403 and other HTTP errors gracefully
  • training/tests/unit/test_training_shapes.py: New test file with 3 tests:
    • 403 response produces actionable ValueError (not RuntimeError)
    • 404 response produces actionable ValueError
    • Exact match path skips model fetch entirely (403 doesn't matter)

Testing

All 336 unit tests pass (32 skipped), including the 3 new ones.

Slack Thread

Open in Web Open in Cursor 

…n auto-select

When auto_select_training_shape fails to find an exact base_model match,
it falls back to fetching model details (model_type, parameter_count) for
a broader search. If that GET returns 403 (or any HTTP error), the code
previously raised a RuntimeError, crashing the job.

Now the RuntimeError from _fetch_model_context is caught, a warning is
logged, and the function falls through to raise a ValueError with the
actionable message 'Provide an explicit training_shape_id' instead.

Observed on job rft--pyroworks-dev-ftsy30v1-e19defec7 where
accounts/fireworks/models/qwen3p5-397b-a17b returned HTTP 403.

Co-authored-by: Webster Bei Yijie <beiyijie@fireworks.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants