fix(training_shapes): add fallback passes and diagnostic logging for shape selection#338
Draft
fix(training_shapes): add fallback passes and diagnostic logging for shape selection#338
Conversation
…shape selection
When auto_select_training_shape fails to find shapes via the combined
server-side filter (base_model + trainer_mode + validated), the function
now retries with progressively relaxed strategies before giving up:
Pass 1 - exact base_model + trainer_mode (original behavior, unchanged)
Pass 1b - base_model only server-side, trainer_mode checked client-side
(handles APIs that don't support string enum filtering)
Pass 1c - account-scoped parent derived from base_model name
(handles wildcard parent not listing cross-account shapes)
Pass 2 - model_type + parameter_count fallback (existing)
Pass 2b - relaxed param filter without server-side trainer_mode
Also:
- Gracefully handles 403/non-success from model details API (ValueError
fallback instead of RuntimeError crash)
- Adds comprehensive INFO/WARNING logging at each pass with filter
expressions and candidate counts for debugging
- Logs _list_and_filter skip reasons (mode mismatch vs ctx too short)
Observed on job rft--pyroworks-dev-ftsy30v1-e19defec7 where model
qwen3p5-397b-a17b had validated shapes visible via firectl but
auto_select returned nothing before hitting a 403 on the model API.
Co-authored-by: Yufei (Benny) Chen <benjibc@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Training job
rft--pyroworks-dev-ftsy30v1-e19defec7for modelqwen3p5-397b-a17bcrashed becauseauto_select_training_shapecouldn't find any matching shapes, even though validated shapes exist and are visible viafirectl:The function's Pass 1 (exact base_model + trainer_mode server-side filter) returned zero candidates silently, then Pass 2 failed because the model API returned HTTP 403 for
accounts/fireworks/models/qwen3p5-397b-a17b.There was no logging to help diagnose why the combined server-side filter returned nothing — the shapes exist, are validated and public, yet the API listing returned empty.
Root Cause Analysis
The most likely causes for Pass 1 returning empty despite shapes existing:
trainer_modefilter mismatch: The API may store trainer_mode as a numeric proto enum (1, 2, 3) while we filter with the string name ("POLICY_TRAINER"). The combined server filter silently returns nothing.accounts/-/trainingShapes/-not listing cross-account shapes: On the dev gateway, the wildcard parent may not enumerate shapes from thefireworksaccount.Fix
Added progressive fallback passes with diagnostic logging:
base_model+trainer_modeserver-side filterbase_modelonly server-side,trainer_modechecked client-sidebase_modelnamemodel_type+parameter_countbucket fallbacktrainer_modeserver-side)Also:
_fetch_model_contextfailures are caught and logged, skipping the param-count fallback instead of crashing_list_and_filterlogs skip reasons (mode mismatch vs context too short) when API returns results but all are filtered client-side_account_from_base_modelhelper: Extracts the account fromaccounts/{account}/models/{model}for the account-scoped retryTesting
test_training_shapes.pycovering all passes, helpers, and edge casesmath_verify/eval_protocoldeps)