Skip to content

fix(training_shapes): add fallback passes and diagnostic logging for shape selection#338

Draft
benjibc wants to merge 1 commit intomainfrom
cursor/fix-shape-selection-logging-9bef
Draft

fix(training_shapes): add fallback passes and diagnostic logging for shape selection#338
benjibc wants to merge 1 commit intomainfrom
cursor/fix-shape-selection-logging-9bef

Conversation

@benjibc
Copy link
Copy Markdown
Contributor

@benjibc benjibc commented Apr 15, 2026

Problem

Training job rft--pyroworks-dev-ftsy30v1-e19defec7 for model qwen3p5-397b-a17b crashed because auto_select_training_shape couldn't find any matching shapes, even though validated shapes exist and are visible via firectl:

accounts/fireworks/trainingShapes/qwen3p5-397b-a17b-32k-b200-lora
accounts/fireworks/trainingShapes/qwen3p5-397b-a17b-262k-b200

The function's Pass 1 (exact base_model + trainer_mode server-side filter) returned zero candidates silently, then Pass 2 failed because the model API returned HTTP 403 for accounts/fireworks/models/qwen3p5-397b-a17b.

There was no logging to help diagnose why the combined server-side filter returned nothing — the shapes exist, are validated and public, yet the API listing returned empty.

Root Cause Analysis

The most likely causes for Pass 1 returning empty despite shapes existing:

  1. Server-side trainer_mode filter mismatch: The API may store trainer_mode as a numeric proto enum (1, 2, 3) while we filter with the string name ("POLICY_TRAINER"). The combined server filter silently returns nothing.
  2. Wildcard parent accounts/-/trainingShapes/- not listing cross-account shapes: On the dev gateway, the wildcard parent may not enumerate shapes from the fireworks account.

Fix

Added progressive fallback passes with diagnostic logging:

Pass Strategy What it catches
1 Exact base_model + trainer_mode server-side filter Original behavior (unchanged)
1b base_model only server-side, trainer_mode checked client-side API doesn't support string enum filtering
1c Account-scoped parent derived from base_model name Wildcard parent doesn't list cross-account shapes
2 model_type + parameter_count bucket fallback Different base_model name, same architecture
2b Relaxed param filter (no trainer_mode server-side) Same as 1b but for the param-count path

Also:

  • Graceful 403 handling: _fetch_model_context failures are caught and logged, skipping the param-count fallback instead of crashing
  • Comprehensive logging: Every pass logs its filter expression and candidate count at INFO level; _list_and_filter logs skip reasons (mode mismatch vs context too short) when API returns results but all are filtered client-side
  • _account_from_base_model helper: Extracts the account from accounts/{account}/models/{model} for the account-scoped retry

Testing

  • Added 23 new tests in test_training_shapes.py covering all passes, helpers, and edge cases
  • All 354 existing unit tests continue to pass (331 passed + 23 new = 354; 32 pre-existing skips, 7 pre-existing failures from missing math_verify/eval_protocol deps)
Open in Web Open in Cursor 

…shape selection

When auto_select_training_shape fails to find shapes via the combined
server-side filter (base_model + trainer_mode + validated), the function
now retries with progressively relaxed strategies before giving up:

Pass 1  - exact base_model + trainer_mode (original behavior, unchanged)
Pass 1b - base_model only server-side, trainer_mode checked client-side
          (handles APIs that don't support string enum filtering)
Pass 1c - account-scoped parent derived from base_model name
          (handles wildcard parent not listing cross-account shapes)
Pass 2  - model_type + parameter_count fallback (existing)
Pass 2b - relaxed param filter without server-side trainer_mode

Also:
- Gracefully handles 403/non-success from model details API (ValueError
  fallback instead of RuntimeError crash)
- Adds comprehensive INFO/WARNING logging at each pass with filter
  expressions and candidate counts for debugging
- Logs _list_and_filter skip reasons (mode mismatch vs ctx too short)

Observed on job rft--pyroworks-dev-ftsy30v1-e19defec7 where model
qwen3p5-397b-a17b had validated shapes visible via firectl but
auto_select returned nothing before hitting a 403 on the model API.

Co-authored-by: Yufei (Benny) Chen <benjibc@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants