docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate#2222
Open
Mikecranesync wants to merge 1 commit into
Open
docs(eval): triage #1948 — "regression" is noise; real cause is non-deterministic FSM gate#2222Mikecranesync wants to merge 1 commit into
Mikecranesync wants to merge 1 commit into
Conversation
…eterministic FSM gate Investigation of #1948 ("eval 78% regression −9pts, 3 clusters"). Findings: - The −9pt "regression" is within the eval's noise band: committed real scorecards swing 38%–87% run-to-run. Current main re-run = 49/57 (85%), healthy and ABOVE the "regressed" 45/57. #1948 caught a low-water-mark run. - Real signal (from 11 historical runs + a current-main re-run): ~4 persistent cases, not 12. One (vfd_mitsu_03) already fixed since baseline. Three fail only the FSM checkpoint; one (gs3_ground_fault_14) fails only KeyKW. - Root cause = LLM-non-deterministic Q→DIAGNOSIS gate transition: identical SCRIPTED pf525_f004_02 turns reach DIAGNOSIS in one run, stall at Q2 in another. (Corrected an earlier wrong draft that blamed the synthetic-user driver — that path is off by default; the default suite uses scripted turns.) - The eval has a record/replay determinism seam (llm_replay.py) but runs live because the replay store is .gitignored/absent → highest-leverage fix. - Separate infra finding: the Nemotron reranker is 404-down on every retrieval, degrading citation/keyword grounding (deserves its own issue). Recommends reframing #1948 from "engine regression" to: (a) eval determinism via record/replay, (b) Nemotron-404 reranker outage, (c) vfd_danfoss_04 fixture fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 22, 2026
Owner
Author
|
Triage acted on — #1948 closed as not planned (false regression; current main healthy at 85%). The three genuine follow-ups this doc recommends are now filed:
This PR remains the canonical reference doc for the investigation. |
Mikecranesync
added a commit
that referenced
this pull request
Jun 22, 2026
…-KB fixture (#2256) * test(eval): skip non-deterministic FSM check on vfd_danfoss_04 out-of-KB fixture vfd_danfoss_04_vlt_fc360_edge is a stochastic out-of-KB edge case (user asks about a nonexistent VLT FC 360). The Q→DIAGNOSIS gate transition is LLM-non-deterministic, so the terminal FSM state flips Q1/Q2/DIAGNOSIS run-to-run. The grader's cp_reached_state then fails whenever it lands at Q1 (< expected Q2) — producing 11/11 spurious FSM-only failures in the #1948 triage even though content/citation checks pass. Add skip_fsm_check: true (the grader's purpose-built flag for exactly this class) so the fixture validates honest out-of-KB behavior via expected_keywords + citation groundedness instead of FSM depth. Mirrors 04_yaskawa_out_of_kb.yaml. Closes follow-up (c) from the #1948 triage (#2222). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(version): bump to 3.39.14 for vfd_danfoss_04 fixture fix Version Gate requires a /VERSION bump for non-doc code changes (the .yaml fixture counts). Patch bump (test/fixture fix) + CHANGELOG note. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(changelog): drop leftover merge-conflict marker Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Mike Harper <bravonode@FactoryLM-Bravo.local> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Triage of #1948 ("eval: 78% pass rate regression −9pts from 87%, 3 failure clusters"), selected as the top actionable beta-readiness item after #2152 (prod outage) was found recovered and #2112 (security leak) already fixed by #2127.
The "−9pt regression" is a false alarm. Current main re-runs at 49/57 (85%) — healthy, and above the "regressed" 45/57. The offline suite swings 38%–87% run-to-run; #1948 caught a low-water-mark sample.
What's actually going on
Using the 11 committed historical scorecards + a fresh current-main run, the real persistent failures narrow to ~4 cases (not 12):
vfd_mitsu_03_a700pf525_f004_02asset_change_mid_session_08vfd_danfoss_04expected_final_state: Q2)gs3_ground_fault_14Root cause: the engine's Q→DIAGNOSIS gate transition is LLM-non-deterministic — identical scripted
pf525_f004_02turns reach DIAGNOSIS in one run and stall at Q2 in another. (An earlier draft wrongly blamed the synthetic-user driver; that path is off by default — corrected in the doc.) The eval has a record/replay determinism seam (llm_replay.py) but runs live because the replay store is.gitignored/absent.Separate infra finding: the Nemotron reranker is 404-down on every retrieval (
integrate.api.nvidia.com/v1/ranking), silently degrading grounding — deserves its own issue.Recommendation
Reframe #1948 from "engine regression" → three scoped follow-ups:
vfd_danfoss_04fixture fixFull detail:
docs/tech-debt/2026-06-22-eval-1948-flakiness-triage.md.Docs-only; no code change.