fix(training_shapes): handle 403/non-success model fetch gracefully in auto-select#337
Draft
websterbei wants to merge 1 commit intomainfrom
Draft
fix(training_shapes): handle 403/non-success model fetch gracefully in auto-select#337websterbei wants to merge 1 commit intomainfrom
websterbei wants to merge 1 commit intomainfrom
Conversation
…n auto-select When auto_select_training_shape fails to find an exact base_model match, it falls back to fetching model details (model_type, parameter_count) for a broader search. If that GET returns 403 (or any HTTP error), the code previously raised a RuntimeError, crashing the job. Now the RuntimeError from _fetch_model_context is caught, a warning is logged, and the function falls through to raise a ValueError with the actionable message 'Provide an explicit training_shape_id' instead. Observed on job rft--pyroworks-dev-ftsy30v1-e19defec7 where accounts/fireworks/models/qwen3p5-397b-a17b returned HTTP 403. Co-authored-by: Webster Bei Yijie <beiyijie@fireworks.ai>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Training job
rft--pyroworks-dev-ftsy30v1-e19defec7crashed with:This happens in
auto_select_training_shapewhen:base_modelmatch is found in validated training shapesGET /v1/accounts/fireworks/models/qwen3p5-397b-a17bto get model type and parameter count for a broader searchThe
RuntimeErroris unrecoverable and provides no actionable guidance.Fix
Catch the
RuntimeErrorfrom_fetch_model_context, log a warning, and skip the parameter-count fallback. The function then falls through to raise aValueErrorwith the message "Provide an explicit training_shape_id", which tells the user exactly what to do.This is the right behavior because:
ValueErrorwith an actionable message is more useful than aRuntimeErrorabout an HTTP status codeChanges
training/utils/training_shapes.py: Wrap_fetch_model_contextcall in try/except to handle 403 and other HTTP errors gracefullytraining/tests/unit/test_training_shapes.py: New test file with 3 tests:ValueError(notRuntimeError)ValueErrorTesting
All 336 unit tests pass (32 skipped), including the 3 new ones.
Slack Thread