docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate by Mikecranesync · Pull Request #2222 · Mikecranesync/MIRA

Mikecranesync · 2026-06-22T05:05:26Z

Summary

Triage of #1948 ("eval: 78% pass rate regression −9pts from 87%, 3 failure clusters"), selected as the top actionable beta-readiness item after #2152 (prod outage) was found recovered and #2112 (security leak) already fixed by #2127.

The "−9pt regression" is a false alarm. Current main re-runs at 49/57 (85%) — healthy, and above the "regressed" 45/57. The offline suite swings 38%–87% run-to-run; #1948 caught a low-water-mark sample.

What's actually going on

Using the 11 committed historical scorecards + a fresh current-main run, the real persistent failures narrow to ~4 cases (not 12):

Case	Now	Real issue
`vfd_mitsu_03_a700`	✓ PASS	already fixed since baseline
`pf525_f004_02`	✗ FSM only	ends Q2 not DIAGNOSIS (content otherwise good)
`asset_change_mid_session_08`	✗ FSM only	ends Q1
`vfd_danfoss_04`	✗ FSM only	fixture/grader mismatch (`expected_final_state: Q2`)
`gs3_ground_fault_14`	✗ KeyKW only	citation/keyword miss (Nemotron-404 aggravated)

Root cause: the engine's Q→DIAGNOSIS gate transition is LLM-non-deterministic — identical scripted pf525_f004_02 turns reach DIAGNOSIS in one run and stall at Q2 in another. (An earlier draft wrongly blamed the synthetic-user driver; that path is off by default — corrected in the doc.) The eval has a record/replay determinism seam (llm_replay.py) but runs live because the replay store is .gitignored/absent.

Separate infra finding: the Nemotron reranker is 404-down on every retrieval (integrate.api.nvidia.com/v1/ranking), silently degrading grounding — deserves its own issue.

Recommendation

Reframe #1948 from "engine regression" → three scoped follow-ups:

(a) eval determinism via record/replay (highest leverage — makes the suite able to detect a real regression)
(b) Nemotron-404 reranker outage (new infra issue)
(c) vfd_danfoss_04 fixture fix

Full detail: docs/tech-debt/2026-06-22-eval-1948-flakiness-triage.md.

Docs-only; no code change.

…eterministic FSM gate Investigation of #1948 ("eval 78% regression −9pts, 3 clusters"). Findings: - The −9pt "regression" is within the eval's noise band: committed real scorecards swing 38%–87% run-to-run. Current main re-run = 49/57 (85%), healthy and ABOVE the "regressed" 45/57. #1948 caught a low-water-mark run. - Real signal (from 11 historical runs + a current-main re-run): ~4 persistent cases, not 12. One (vfd_mitsu_03) already fixed since baseline. Three fail only the FSM checkpoint; one (gs3_ground_fault_14) fails only KeyKW. - Root cause = LLM-non-deterministic Q→DIAGNOSIS gate transition: identical SCRIPTED pf525_f004_02 turns reach DIAGNOSIS in one run, stall at Q2 in another. (Corrected an earlier wrong draft that blamed the synthetic-user driver — that path is off by default; the default suite uses scripted turns.) - The eval has a record/replay determinism seam (llm_replay.py) but runs live because the replay store is .gitignored/absent → highest-leverage fix. - Separate infra finding: the Nemotron reranker is 404-down on every retrieval, degrading citation/keyword grounding (deserves its own issue). Recommends reframing #1948 from "engine regression" to: (a) eval determinism via record/replay, (b) Nemotron-404 reranker outage, (c) vfd_danfoss_04 fixture fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mikecranesync · 2026-06-22T20:40:50Z

Triage acted on — #1948 closed as not planned (false regression; current main healthy at 85%). The three genuine follow-ups this doc recommends are now filed:

(a) eval-suite determinism via cascade record/replay → test(eval): make offline-text suite deterministic via cascade record/replay (FSM-gate flakiness) #2258 (highest leverage)
(b) Nemotron-404 reranker outage → infra: Nemotron reranker is 404-down on every retrieval — silently degrades grounding #2257
(c) vfd_danfoss_04 fixture skip_fsm_check → PR test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture #2256

This PR remains the canonical reference doc for the investigation.

…-KB fixture (#2256) * test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture vfd_danfoss_04_vlt_fc360_edge is a stochastic out-of-KB edge case (user asks about a nonexistent VLT FC 360). The Q→DIAGNOSIS gate transition is LLM-non-deterministic, so the terminal FSM state flips Q1/Q2/DIAGNOSIS run-to-run. The grader's cp_reached_state then fails whenever it lands at Q1 (< expected Q2) — producing 11/11 spurious FSM-only failures in the #1948 triage even though content/citation checks pass. Add skip_fsm_check: true (the grader's purpose-built flag for exactly this class) so the fixture validates honest out-of-KB behavior via expected_keywords + citation groundedness instead of FSM depth. Mirrors 04_yaskawa_out_of_kb.yaml. Closes follow-up (c) from the #1948 triage (#2222). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(version): bump to 3.39.14 for vfd_danfoss_04 fixture fix Version Gate requires a /VERSION bump for non-doc code changes (the .yaml fixture counts). Patch bump (test/fixture fix) + CHANGELOG note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(changelog): drop leftover merge-conflict marker Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Mike Harper <bravonode@FactoryLM-Bravo.local> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Mikecranesync temporarily deployed to staging June 22, 2026 05:05 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate#2222

docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate#2222
Mikecranesync wants to merge 1 commit into
mainfrom
docs/eval-1948-triage

Mikecranesync commented Jun 22, 2026

Uh oh!

Mikecranesync commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Mikecranesync commented Jun 22, 2026

Summary

What's actually going on

Recommendation

Uh oh!

Mikecranesync commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant