Derivative of #25. That issue establishes the type split (NUMBER vs PRESCRIPTION) and the near-term safety slice (decision-aware gate + no in-place rewrite of prescriptions). This issue is the research-gated epic: how we actually verify a prescription once we stop rewriting it.
The framing decided in #25 (discussion): falsify, don't predict. The science says a single point-prediction simulator is not achievable — the fitness-fatigue model is statistically ill-conditioned with poor parameter identifiability and overfitting fatigue terms (Sci Rep 2025), and Banister himself recommended re-fitting every 60–90 days. So the verifier's job is refutation against a typed feasibility envelope, with stated residual uncertainty — never a promise of outcome. This keeps the "model never self-certifies" invariant and breaks the recursion (a sound critic need not be as capable as the planner — LLM-Modulo, Kambhampati et al., ICML 2024).
Goals
-
Feasibility-model registry, keyed by (sport, goal_type), each model implementing one interface:
check(prescription, canonical_state, goal) -> {feasible | infeasible | uncertain,
binding_critique, citation}
- Declared default per domain + a rule-based floor when no fitted model applies.
- Unknown / unsupported
(sport, goal_type) → abstain/degrade honestly ("can't verify a plan for this yet"), never flatten to 7×CTL maintenance.
GoalType is a closed enum (EVENT, TARGET_METRIC, DISTANCE, PROCESS, OTHER, domain/enums.py:339); sport is the data-driven registry (persistence/models/athlete.py). The registry key is (sport, goal_type).
-
Banded endurance verifier (first concrete model). Reuse the existing deterministic, seedable AnalyticsService.pmc(...) (analytics/pmc.py, service.py:315) as the forward integrator. Simulate CTL/ATL/TSB across a plausible parameter band, not a point estimate; CONTRADICTED only when infeasible across the whole band; in-band ambiguity → DEGRADE with uncertainty stated.
-
Safety envelope as review flags, not injury predictions — and explicitly NOT ACWR. ACWR is discredited as a causal/predictive injury metric (Impellizzeri et al. 2020, PMID 32502973; 2024 review); the persisted daily_wellness.acwr column (persistence/models/wellness.py:126, read by nothing) stays orphaned. Use instead:
- bounded weekly ramp rate (coaching guardrail);
- Foster monotony & strain (Foster 1998, MSSE, PMID 9662690);
- internal consistency (weekly total = Σ prescribed sessions — subsumes 7×CTL as one feasible point);
- timeline coherence (room for a taper before
target_date).
-
Intensity anchoring off canonical thresholds — we already have CRITICAL_POWER_W, W_PRIME_J, FitnessSignature.ftp_w. Verify per-session targets as deterministic functions of CP/FTP (zone derivations), extending the closed derivation table the 7×CTL precedent legitimized.
-
Strength / hypertrophy is a different verifier, not the same simulator. It is not an impulse-response phenomenon — model it as volume landmarks (MEV→MAV→MRV) + progressive overload per the ACSM progression-model position stand 2009 and the RP volume-landmark framework. This is the clearest reason the registry must support heterogeneous models behind one interface rather than one universal simulator.
-
Close the generate-test loop. Per-claim grounding critiques are computed in ground and discarded before compose (graph_state.py render_context), so REGENERATE is a blind re-roll that re-trips the same wall. Render machine-readable binding_critiques back into the redraft context — turns the existing loop into a real LLM-Modulo loop with no graph-topology change. A natural extension: deterministic optimal-control over the FF model proposes a feasible load scaffold the LLM only narrates (Busso & Thomas 2006 lineage).
-
New citation kind feasibility_envelope (preferred over simulated_trajectory) — records model_id, parameter_band, binding_critique; reproducible from (canonical_state, prescription, model_id, band). Add alongside existing kinds metric / user_request / name / url.
Coordination with #10
#10's entailment verifier makes prescriptions worse if applied naively — nothing future is entailed by a record of the past, so a stronger descriptive verifier scrubs progressive plans harder. #10 and this epic must share the PRESCRIPTION claim type as the seam so they compose. Prescriptions need different verification semantics, not more descriptive rigor.
Verification principle
Verify adherence to coaching principles (progressive overload, recovery/supercompensation, specificity, timeline) — not predicted outcomes. A taper week is feasible because the recovery principle demands it before an event, not "wrong because it is below 7×CTL."
Eval (plan goldens the current suites can't express)
- A build-plan case asserting simulation-feasible progressive targets survive grounding.
- A taper case asserting a prescribed load is never rewritten upward.
- A strength-goal case asserting the agent degrades honestly when no fitted model applies, rather than flattening to maintenance.
- An ambiguity case asserting in-band uncertainty yields
DEGRADE (stated), not CONTRADICTED.
Spec-first
Define the PRESCRIPTION claim type, the registry interface, and the feasibility_envelope citation contract in the grounding + forward-model specs before code, so the new citation kind is normative and the agent can cite it.
Open research questions (carried from #25 debate)
- Which parameter priors/bands per sport, and how to narrow them from sparse individual history without overfitting (the identifiability trap).
- Minimum viable model set for v1 (proposal: endurance-event default + rule-based floor; strength = explicit "not yet verifiable").
- How to surface residual uncertainty to the athlete in the approval surface without eroding trust.
Derivative of #25. That issue establishes the type split (
NUMBERvsPRESCRIPTION) and the near-term safety slice (decision-aware gate + no in-place rewrite of prescriptions). This issue is the research-gated epic: how we actually verify a prescription once we stop rewriting it.The framing decided in #25 (discussion): falsify, don't predict. The science says a single point-prediction simulator is not achievable — the fitness-fatigue model is statistically ill-conditioned with poor parameter identifiability and overfitting fatigue terms (Sci Rep 2025), and Banister himself recommended re-fitting every 60–90 days. So the verifier's job is refutation against a typed feasibility envelope, with stated residual uncertainty — never a promise of outcome. This keeps the "model never self-certifies" invariant and breaks the recursion (a sound critic need not be as capable as the planner — LLM-Modulo, Kambhampati et al., ICML 2024).
Goals
Feasibility-model registry, keyed by
(sport, goal_type), each model implementing one interface:(sport, goal_type)→ abstain/degrade honestly ("can't verify a plan for this yet"), never flatten to 7×CTL maintenance.GoalTypeis a closed enum (EVENT,TARGET_METRIC,DISTANCE,PROCESS,OTHER,domain/enums.py:339);sportis the data-driven registry (persistence/models/athlete.py). The registry key is(sport, goal_type).Banded endurance verifier (first concrete model). Reuse the existing deterministic, seedable
AnalyticsService.pmc(...)(analytics/pmc.py,service.py:315) as the forward integrator. Simulate CTL/ATL/TSB across a plausible parameter band, not a point estimate;CONTRADICTEDonly when infeasible across the whole band; in-band ambiguity →DEGRADEwith uncertainty stated.Safety envelope as review flags, not injury predictions — and explicitly NOT ACWR. ACWR is discredited as a causal/predictive injury metric (Impellizzeri et al. 2020, PMID 32502973; 2024 review); the persisted
daily_wellness.acwrcolumn (persistence/models/wellness.py:126, read by nothing) stays orphaned. Use instead:target_date).Intensity anchoring off canonical thresholds — we already have
CRITICAL_POWER_W,W_PRIME_J,FitnessSignature.ftp_w. Verify per-session targets as deterministic functions of CP/FTP (zone derivations), extending the closed derivation table the 7×CTL precedent legitimized.Strength / hypertrophy is a different verifier, not the same simulator. It is not an impulse-response phenomenon — model it as volume landmarks (MEV→MAV→MRV) + progressive overload per the ACSM progression-model position stand 2009 and the RP volume-landmark framework. This is the clearest reason the registry must support heterogeneous models behind one interface rather than one universal simulator.
Close the generate-test loop. Per-claim grounding critiques are computed in
groundand discarded beforecompose(graph_state.pyrender_context), so REGENERATE is a blind re-roll that re-trips the same wall. Render machine-readablebinding_critiques back into the redraft context — turns the existing loop into a real LLM-Modulo loop with no graph-topology change. A natural extension: deterministic optimal-control over the FF model proposes a feasible load scaffold the LLM only narrates (Busso & Thomas 2006 lineage).New citation kind
feasibility_envelope(preferred oversimulated_trajectory) — recordsmodel_id,parameter_band,binding_critique; reproducible from(canonical_state, prescription, model_id, band). Add alongside existing kindsmetric/user_request/name/url.Coordination with #10
#10's entailment verifier makes prescriptions worse if applied naively — nothing future is entailed by a record of the past, so a stronger descriptive verifier scrubs progressive plans harder. #10 and this epic must share the
PRESCRIPTIONclaim type as the seam so they compose. Prescriptions need different verification semantics, not more descriptive rigor.Verification principle
Verify adherence to coaching principles (progressive overload, recovery/supercompensation, specificity, timeline) — not predicted outcomes. A taper week is feasible because the recovery principle demands it before an event, not "wrong because it is below 7×CTL."
Eval (plan goldens the current suites can't express)
DEGRADE(stated), notCONTRADICTED.Spec-first
Define the
PRESCRIPTIONclaim type, the registry interface, and thefeasibility_envelopecitation contract in the grounding + forward-model specs before code, so the new citation kind is normative and the agent can cite it.Open research questions (carried from #25 debate)