fix(training_shapes): add fallback passes and diagnostic logging for shape selection by benjibc · Pull Request #338 · fw-ai/cookbook

benjibc · 2026-04-15T06:55:59Z

Problem

Training job rft--pyroworks-dev-ftsy30v1-e19defec7 for model qwen3p5-397b-a17b crashed because auto_select_training_shape couldn't find any matching shapes, even though validated shapes exist and are visible via firectl:

accounts/fireworks/trainingShapes/qwen3p5-397b-a17b-32k-b200-lora
accounts/fireworks/trainingShapes/qwen3p5-397b-a17b-262k-b200

The function's Pass 1 (exact base_model + trainer_mode server-side filter) returned zero candidates silently, then Pass 2 failed because the model API returned HTTP 403 for accounts/fireworks/models/qwen3p5-397b-a17b.

There was no logging to help diagnose why the combined server-side filter returned nothing — the shapes exist, are validated and public, yet the API listing returned empty.

Root Cause Analysis

The most likely causes for Pass 1 returning empty despite shapes existing:

Server-side trainer_mode filter mismatch: The API may store trainer_mode as a numeric proto enum (1, 2, 3) while we filter with the string name ("POLICY_TRAINER"). The combined server filter silently returns nothing.
Wildcard parent accounts/-/trainingShapes/- not listing cross-account shapes: On the dev gateway, the wildcard parent may not enumerate shapes from the fireworks account.

Fix

Added progressive fallback passes with diagnostic logging:

Pass	Strategy	What it catches
1	Exact `base_model` + `trainer_mode` server-side filter	Original behavior (unchanged)
1b	`base_model` only server-side, `trainer_mode` checked client-side	API doesn't support string enum filtering
1c	Account-scoped parent derived from `base_model` name	Wildcard parent doesn't list cross-account shapes
2	`model_type` + `parameter_count` bucket fallback	Different base_model name, same architecture
2b	Relaxed param filter (no `trainer_mode` server-side)	Same as 1b but for the param-count path

Also:

Graceful 403 handling: _fetch_model_context failures are caught and logged, skipping the param-count fallback instead of crashing
Comprehensive logging: Every pass logs its filter expression and candidate count at INFO level; _list_and_filter logs skip reasons (mode mismatch vs context too short) when API returns results but all are filtered client-side
_account_from_base_model helper: Extracts the account from accounts/{account}/models/{model} for the account-scoped retry

Testing

Added 23 new tests in test_training_shapes.py covering all passes, helpers, and edge cases
All 354 existing unit tests continue to pass (331 passed + 23 new = 354; 32 pre-existing skips, 7 pre-existing failures from missing math_verify/eval_protocol deps)

…shape selection When auto_select_training_shape fails to find shapes via the combined server-side filter (base_model + trainer_mode + validated), the function now retries with progressively relaxed strategies before giving up: Pass 1 - exact base_model + trainer_mode (original behavior, unchanged) Pass 1b - base_model only server-side, trainer_mode checked client-side (handles APIs that don't support string enum filtering) Pass 1c - account-scoped parent derived from base_model name (handles wildcard parent not listing cross-account shapes) Pass 2 - model_type + parameter_count fallback (existing) Pass 2b - relaxed param filter without server-side trainer_mode Also: - Gracefully handles 403/non-success from model details API (ValueError fallback instead of RuntimeError crash) - Adds comprehensive INFO/WARNING logging at each pass with filter expressions and candidate counts for debugging - Logs _list_and_filter skip reasons (mode mismatch vs ctx too short) Observed on job rft--pyroworks-dev-ftsy30v1-e19defec7 where model qwen3p5-397b-a17b had validated shapes visible via firectl but auto_select returned nothing before hitting a 403 on the model API. Co-authored-by: Yufei (Benny) Chen <benjibc@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(training_shapes): add fallback passes and diagnostic logging for shape selection#338

fix(training_shapes): add fallback passes and diagnostic logging for shape selection#338
benjibc wants to merge 1 commit intomainfrom
cursor/fix-shape-selection-logging-9bef

benjibc commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

benjibc commented Apr 15, 2026

Problem

Root Cause Analysis

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants